Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High check_frequency values for BufferClient break tf2 #178

Closed
g-gemignani opened this issue Jun 1, 2016 · 16 comments
Closed

High check_frequency values for BufferClient break tf2 #178

g-gemignani opened this issue Jun 1, 2016 · 16 comments
Labels

Comments

@g-gemignani
Copy link
Contributor

Error:

Traceback (most recent call last):
  File "./tf_breaker.py", line 19, in <module>
    timeout=rospy.Duration(1))
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 66, in lookup_transform
    return self.__process_goal(goal)
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 123, in __process_goal
    return self.__process_result(self.client.get_result())
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 127, in __process_result
    raise tf2.TransformException("The BufferServer returned None for result or result.error!  Something is likely wrong with the server.")
tf2.TransformException: The BufferServer returned None for result or result.error!  Something is likely wrong with the server.

How to reproduce it:

#!/usr/bin/env python2

import tf2_ros
import rospy

if __name__ == '__main__':
    rospy.init_node('tf_breaker')
    bc = tf2_ros.BufferClient("/tf2_buffer_server", check_frequency=1000)
    i = 0
    while True:
        time_req = rospy.Time.now()
        bc.can_transform('base_footprint',
                         'map',
                         time_req,
                         rospy.Duration())
        ts = bc.lookup_transform('base_footprint',
                                 'map',
                                 time_req,
                                 timeout=rospy.Duration(1))
        rospy.loginfo('tf lookup number %d worked!', i)
        i += 1
@tfoote
Copy link
Member

tfoote commented Jun 1, 2016

I can confirm that I saw this error once but cannot reproduce it again.

$ python test.py 
Traceback (most recent call last):
  File "test.py", line 19, in <module>
    timeout=rospy.Duration(1))
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 66, in lookup_transform
    return self.__process_goal(goal)
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 123, in __process_goal
    return self.__process_result(self.client.get_result())
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 127, in __process_result
    raise tf2.TransformException("The BufferServer returned None for result or result.error!  Something is likely wrong with the server.")
tf2.TransformException: The BufferServer returned None for result or result.error!  Something is likely wrong with the server.

On repeated calls I get the expected LookupException

$ python test.py 
Traceback (most recent call last):
  File "test.py", line 19, in <module>
    timeout=rospy.Duration(1))
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 66, in lookup_transform
    return self.__process_goal(goal)
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 123, in __process_goal
    return self.__process_result(self.client.get_result())
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 130, in __process_result
    raise tf2.LookupException(result.error.error_string)
tf2.LookupException: "base_footprint" passed to lookupTransform argument target_frame does not exist.

Cranking the rate up to 100000 I was also able to get a double answer:

$ python test.py 
[ERROR] [WallTime: 1464801977.428899] Received comm state ACTIVE when in simple state DONE with SimpleActionClient in NS /tf2_buffer_server
[ERROR] [WallTime: 1464801978.438880] SimpleActionClient received DONE twice
Traceback (most recent call last):
  File "test.py", line 19, in <module>
    timeout=rospy.Duration(1))
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 66, in lookup_transform
    return self.__process_goal(goal)
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 123, in __process_goal
    return self.__process_result(self.client.get_result())
  File "/opt/ros/indigo/lib/python2.7/dist-packages/tf2_ros/buffer_client.py", line 130, in __process_result
    raise tf2.LookupException(result.error.error_string)
tf2.LookupException: "base_footprint" passed to lookupTransform argument target_frame does not exist. 

@g-gemignani Does it fail every time for you what sort of hardware/network setup are you using? I'm currently testing on loopback.

Note for others reproducing you also need to run a buffer_server.

I suspect it's in this block jumping out early: https://github.com/ros/geometry2/blob/indigo-devel/tf2_ros/src/tf2_ros/buffer_client.py#L177-L180 since that's the only block that the rate interacts with assuming it's the rate.

@g-gemignani
Copy link
Contributor Author

@tfoote We always experience this problem on our SCITOS base. We are currently running everything on the base (Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz, 8GB ram). All the traffic is local. I will spend a little bit of time trying to come up with a server that reproduces the error. I'll keep you updated. Thanks! :)

@g-gemignani
Copy link
Contributor Author

@tfoote As promised, here is the package.
Cheers! :)
tf2_test.tar.gz

@tfoote tfoote added the bug label Jul 10, 2017
@tfoote tfoote mentioned this issue Jul 13, 2017
@tfoote
Copy link
Member

tfoote commented Jul 13, 2017

@g-gemignani Thanks for the simple reproducable example.

After a bunch of digging I believe that what's happening is that there's a

I did some debugging into this and I definitely can see the issue. However I haven't yet traced down the source.

My current speculation is that this is actually a race condition inside the simple action client usage. In particular whether the get_result() and get_state() calls. This range The only thing between them before was the sleep for the rate. Adding a sleep right before the get_result() call makes almost all the errors go away. Both the simple_action_client errors as well as the

@mikaelarguedas FYI

@jorgenfb
Copy link
Contributor

We are also experiencing this issue. Is it possible to work around this race condition in the buffer client by adding a check inside the __is_done() function? Require a result if the state is SUCCEEDED?

    def __is_done(self, state):
        if state == GoalStatus.REJECTED or state == GoalStatus.ABORTED or \
           state == GoalStatus.RECALLED or state == GoalStatus.PREEMPTED or \
           state == GoalStatus.LOST:
            return True
        if state == GoalStatus.SUCCEEDED and self.client.get_result():
            return True
        return False

@tfoote
Copy link
Member

tfoote commented Oct 3, 2018

@jorgenfb Could you propose the change in a PR for review?

@jorgenfb
Copy link
Contributor

jorgenfb commented Oct 5, 2018

PR is ready. Any chance this can be backported to kinetic?

tfoote pushed a commit that referenced this issue Oct 11, 2018
@tfoote
Copy link
Member

tfoote commented Oct 11, 2018

This doesn't break ABI or API so it should be good to consider for backport. I'll plan to include it next time I do a backport sweep.

I'm going to close this. @g-gemignani if you could verify the patch resolves your issue that would be great. If not please comment back here and we can reopen this.

@tfoote tfoote closed this as completed Oct 11, 2018
tfoote pushed a commit that referenced this issue Oct 19, 2018
tfoote pushed a commit that referenced this issue Oct 19, 2018
@g-gemignani
Copy link
Contributor Author

Hello @tfoote, hello @jorgenfb. I apologize for the late reply. I went on, checked out the pr and reran my test with and without the patch. Without the patch the output given by the node is:

[/tf_breaker INFO 1540117566.835347]: Listening for tfs and waiting for errors...
[/tf_breaker ERROR 1540117569.839715]: tf lookup number 0 failed!
[/tf_breaker ERROR 1540117569.840086]: Lookup would require extrapolation into the past. Requested time 1540117566.835526943 but the earliest data is at time 1540117567.097922086, when looking up transform from frame [base_link] to frame [map]
[/tf_breaker INFO 1540117569.840367]: Listening for tfs and waiting for errors...
[/tf_breaker ERROR 1540117570.007823]: tf lookup number 2 failed!
[/tf_breaker ERROR 1540117570.008253]: The BufferServer returned None for result! Something is likely wrong with the server.
[/tf_breaker INFO 1540117570.008483]: Listening for tfs and waiting for errors...
[/tf_breaker ERROR 1540117570.107324]: tf lookup number 3 failed!
[/tf_breaker ERROR 1540117570.107740]: The BufferServer returned None for result! Something is likely wrong with the server.
[/tf_breaker INFO 1540117570.107966]: Listening for tfs and waiting for errors...
[/tf_breaker INFO 1540117570.807513]: Requested 10 tf lookups. Continuing...
[/tf_breaker ERROR 1540117571.008009]: tf lookup number 12 failed!
[/tf_breaker ERROR 1540117571.008497]: The BufferServer returned None for result! Something is likely wrong with the server.
[/tf_breaker INFO 1540117571.008717]: Listening for tfs and waiting for errors...
[/tf_breaker INFO 1540117571.809085]: Requested 20 tf lookups. Continuing...
[/tf_breaker INFO 1540117572.808300]: Requested 30 tf lookups. Continuing...

with the patch the output is:

[/tf_breaker INFO 1540118132.475942]: Listening for tfs and waiting for errors...
[/tf_breaker ERROR 1540118135.483135]: tf lookup number 0 failed!
[/tf_breaker ERROR 1540118135.484018]: Lookup would require extrapolation into the past. Requested time 1540118132.476082086 but the earliest data is at time 1540118132.730367898, when looking up transform from frame [base_link] to frame [map]
[/tf_breaker INFO 1540118135.484715]: Listening for tfs and waiting for errors...
[/tf_breaker INFO 1540118136.441680]: Requested 10 tf lookups. Continuing...
[/tf_breaker INFO 1540118137.442052]: Requested 20 tf lookups. Continuing...
[/tf_breaker ERROR 1540118138.341125]: Received comm state ACTIVE when in simple state DONE with SimpleActionClient in NS /tf2_buffer_server
[/tf_breaker ERROR 1540118138.341439]: SimpleActionClient received DONE twice
[/tf_breaker INFO 1540118138.441333]: Requested 30 tf lookups. Continuing...
[/tf_breaker ERROR 1540118139.280240]: Received comm state ACTIVE when in simple state DONE with SimpleActionClient in NS /tf2_buffer_server
[/tf_breaker ERROR 1540118139.340648]: SimpleActionClient received DONE twice
[/tf_breaker INFO 1540118139.441378]: Requested 40 tf lookups. Continuing...
[/tf_breaker INFO 1540118140.441591]: Requested 50 tf lookups. Continuing...
[/tf_breaker INFO 1540118141.441462]: Requested 60 tf lookups. Continuing...
[/tf_breaker INFO 1540118142.440996]: Requested 70 tf lookups. Continuing...
[/tf_breaker ERROR 1540118143.145556]: Received comm state ACTIVE when in simple state DONE with SimpleActionClient in NS /tf2_buffer_server
[/tf_breaker ERROR 1540118143.240891]: SimpleActionClient received DONE twice
[/tf_breaker INFO 1540118143.441722]: Requested 80 tf lookups. Continuing...
[/tf_breaker ERROR 1540118143.744696]: Received comm state ACTIVE when in simple state DONE with SimpleActionClient in NS /tf2_buffer_server
[/tf_breaker ERROR 1540118143.840007]: SimpleActionClient received DONE twice
[/tf_breaker INFO 1540118144.441042]: Requested 90 tf lookups. Continuing...
[/tf_breaker INFO 1540118145.441657]: Requested 100 tf lookups. Continuing...
[/tf_breaker INFO 1540118146.441883]: Requested 110 tf lookups. Continuing...
[/tf_breaker INFO 1540118147.441090]: Requested 120 tf lookups. Continuing...
[/tf_breaker ERROR 1540118147.542277]: Received comm state ACTIVE when in simple state DONE with SimpleActionClient in NS /tf2_buffer_server
[/tf_breaker ERROR 1540118147.542585]: SimpleActionClient received DONE twice
[/tf_breaker INFO 1540118148.440971]: Requested 130 tf lookups. Continuing...
[/tf_breaker ERROR 1540118149.146334]: Received comm state ACTIVE when in simple state DONE with SimpleActionClient in NS /tf2_buffer_server
[/tf_breaker ERROR 1540118149.149111]: SimpleActionClient received DONE twice

From this short test I believe that the problem was partially mitigated. In fact, no more tf errors are thrown, however there is still something wrong with the tf state machine. Thank you for the patch @jorgenfb !

@tfoote
Copy link
Member

tfoote commented Oct 22, 2018

@g-gemignani thanks for the updated testing. I'll reopen this to track the remaining issue.

@tfoote tfoote reopened this Oct 22, 2018
@jorgenfb
Copy link
Contributor

I did some further digging. The issue is with the action client, not with BufferClient (state is available before result). I created a simplified case without using bufferclient, which I will post along with an issue in actionlib when I have cleaned it up.

However during my testing I found two solutions that can be used for BufferClient. SimpleActionClient does have a .wait_for_result() method that will prevent this issue . I guess that BufferClient decided to do its own version to be able to speed up the check frequency when needed. Anone know if this is the case?

The other solution is to check the SimpleGoalState instead of the state:
Change this line:

while not rospy.is_shutdown() and not self.__is_done(self.client.get_state()) and not timed_out:

To this one:

while not rospy.is_shutdown() and not self.client.simple_state == SimpleGoalState.DONE and not timed_out:

and remove __is_done() complete.

Do we want to update to the last solution, I can prepare a new PR? And if so, do we close this issue and continue once I have posted an issue in actionlib?

@tfoote
Copy link
Member

tfoote commented Oct 22, 2018

It would be great to move to the actionlib API but you're right that it doesn't support the adjustable timeout. Making that a parameter in the actionlib API would be a great enhancement ticket/PR. In the mean time we should copy the logic internal to that method. https://github.com/ros/actionlib/blob/d317e63ebabc6c437f8b9f73ece06acf348700ea/src/actionlib/simple_action_client.py#L124

Actually on looking closer, internal to the action client the wait is a condition wait, which gets notified when there's new content. Which means that it shouldn't actually need a parameter to do the polling at any specific frequency and it will be better. The looping logic is just implemented to make sure that we don't jump out to early on an invalid wakeup and that the condition wait will also respond to shutdown requests. So I might suggest that we just switch to use wait_for_result and deprecate the now unused argument.

The loop rate could still be parameterized since the 0.1 seconds is a magic number, but that's lower priority for me as it's just the shutdown response time and won't effect runtime result responsiveness.

tfoote pushed a commit that referenced this issue Oct 23, 2018
@jorgenfb
Copy link
Contributor

Using wait_for_result() does indeed look like the best solution. How is the process of doing such a change that would effect the public API (check_frequency parameter is no longer in use)? I'm guessing it's no longer just a bug fix.

I would be happy to provide the actual code changes if you like, but then would need assistance on handling the deprecation.

@tfoote
Copy link
Member

tfoote commented Oct 23, 2018

I would suggest making two changes. The first will simply switch over to using the wait_for_result() and a second would update the constructor to make the default check_frequency None and if the check_frequency is set to anything then print a warning that the that argument is deprecated in and should not be used, and then ignore it. This will give a visible but not spammy warning but not break anyone who calls it.

jorgenfb added a commit to jorgenfb/geometry2 that referenced this issue Oct 27, 2018
…ate check_frequency

Use the API provided by actionlib for waiting for result. This will improve the response time and prevent problems with custom solutions (see ros#178). This change makes constructor parameter check_frequency obsolute and deprecates it.
@jorgenfb
Copy link
Contributor

I have created a pull request for the suggested changes and opened an issue for the original cause of this issue in actionlib repo, including a reproducable test case.

jorgenfb added a commit to jorgenfb/geometry2 that referenced this issue Oct 29, 2018
Use the API provided by actionlib for waiting for result. This will improve the response time and prevent problems with custom solutions (see ros#178). This change makes constructor parameter check_frequency obsolute and deprecates it.
tfoote pushed a commit that referenced this issue Oct 29, 2018
Use the API provided by actionlib for waiting for result. This will improve the response time and prevent problems with custom solutions (see #178). This change makes constructor parameter check_frequency obsolute and deprecates it.
@tfoote
Copy link
Member

tfoote commented Oct 29, 2018

@jorgenfb Thanks for digging in and the follow up. As #337 has been merged I'm going to close this.

@tfoote tfoote closed this as completed Oct 29, 2018
tfoote pushed a commit that referenced this issue Oct 29, 2018
Use the API provided by actionlib for waiting for result. This will improve the response time and prevent problems with custom solutions (see #178). This change makes constructor parameter check_frequency obsolute and deprecates it.
tfoote pushed a commit that referenced this issue Oct 29, 2018
Use the API provided by actionlib for waiting for result. This will improve the response time and prevent problems with custom solutions (see #178). This change makes constructor parameter check_frequency obsolute and deprecates it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants