Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of received bytes differ from message length (SmplMsgConnection::receiveMsg). #25

Closed
liborw opened this issue Jul 25, 2013 · 14 comments
Assignees
Labels
Milestone

Comments

@liborw
Copy link
Contributor

liborw commented Jul 25, 2013

When testing the motoman driver, I got this error:

$ rosrun fs100 robot_state
[ INFO] [1374677672.437211315]: Added message handler for message type: 0
[ INFO] [1374677672.440014448]: Robot state connecting to IP address: '192.168.0.101:50241'
[ WARN] [1374677672.440910442]: Unable to find user-specified joint names in 'controller_joint_names'
[ WARN] [1374677672.441766703]: Unable to find URDF joint names in 'robot_description'
[ INFO] [1374677672.441864812]: Using standard 6-DOF joint names: [joint_1, joint_2, joint_3, joint_4, joint_5, joint_6]
[ INFO] [1374677672.442550563]: Connected to server
[ INFO] [1374677672.442607446]: Initializing message manager with default comms fault handler
[ INFO] [1374677672.442657083]: Default communications fault handler successfully initialized
[ INFO] [1374677672.442685145]: Initializing message manager
[ INFO] [1374677672.442726762]: Added message handler for message type: 1
[ INFO] [1374677672.446381376]: Added message handler for message type: 15
[ INFO] [1374677672.447653822]: Entering message manager spin loop
[ERROR] [1374677672.495079875]: Get unload pointer failed, buffer size: 3, smaller than byte size: 4
[ERROR] [1374677672.495149280]: Unload pointer returned NULL
[ERROR] [1374677672.495206461]: Failed to unload message joint: 5 from data[3]
[ERROR] [1374677672.495241827]: Failed to unload joint feedback positions
[ERROR] [1374677672.495278751]: Failed to unload joint feedback message data
[ERROR] [1374677672.495314845]: Failed to initialize joint feedback message
[ERROR] [1374677672.495351389]: Failed to convert SimpleMessage
[ WARN] [1374677672.495409474]: Recieved zero bytes: 0
[ERROR] [1374677672.495445331]: Failed to initialize message
[ERROR] [1374677672.495485843]: Failed to receive incoming message
[ WARN] [1374677672.495522284]: Send failure, no callback support
^C[ INFO] [1374677674.162045284]: Connection failed, attempting reconnect
[ERROR] [1374677674.162110342]: Failed to connect to server, rc: -1, errno: 106
[ WARN] [1374677674.162132291]: Not connected, bytes not sent
[ERROR] [1374677674.162152122]: Failed to receive message length
[ERROR] [1374677674.162164475]: Failed to receive incoming message
[ WARN] [1374677674.162174342]: Send failure, no callback support

When started with DEBUG logging level the ouput.

So, the problem is that the size of the ByteArray returned by receiveBytes is shorter that the expected message length (that should be checked). I have looked on the packets using Wireshark and they seems to be ok.

@gavanderhoorn
Copy link
Member

'looked at the packets using Wireshark': did you use packet-simplemessage?

Could you attach / upload a Wireshark capture somewhere?

@liborw
Copy link
Contributor Author

liborw commented Jul 25, 2013

Yes I have, the file can be downloaded here, but it will expire in one day.

@liborw
Copy link
Contributor Author

liborw commented Jul 25, 2013

I have an update on this, adding a MSG_WAITALL flag to the RECV command in simple_message/src/socket/tcp_socket.cpp:84 solved the problem. Is there a reason not to have this flag there?

@gavanderhoorn
Copy link
Member

Ah, this might then be related to swri-ros-pkg/issues/23.


Yes, the first few simple_message packets arrive over 2 TCP frames (packet 786 in your capture). Seems the deserialisation step is confused by this.

@JeremyZoss
Copy link
Member

I think this assessment is probably correct.

While Libor's fix probably solves this issue in the short-term, I think we should hold off on pushing this change to the public repository. This change causes the socket code to wait forever until the requested bytes are read. In the event of datastream errors, this might lead to long waits with no errors. A better solution would be to properly handle both split and merged simple_message packets in a mechanism that also provides timeout-capable waits. So that we could wait for the expected # of bytes, possibly across multiple packets, but return after a specified timeout period if the expected # of bytes is not received. This is indeed the issue logged in bug #23.

Please close this bug if your issue is resolved.

@gavanderhoorn
Copy link
Member

One idea might be to utilise the following pattern:

  • read all bytes currently in the receive buffer (a single recv(..))
  • copy those bytes to a internal buffer
  • if len(my_buffer) >= nr_of_bytes_needed: handle(my_buffer[:nr_of_bytes_needed])
  • remove bytes from my_buffer
  • goto 'start'

Fairly standard. If there aren't enough bytes, wait for them to come in the next recv(..) (but with a configurable timeout).

@JeremyZoss
Copy link
Member

That's mostly what I was thinking. Here are two possible variations:

  1. only read up to the number of expected bytes (not full receive buffer)
    - this prevents the need to shift bytes in my_buffer
    - but potentially means slightly more reads of the socket buffers (not very expensive)

  2. use SO_RCVTIMEO option on the socket, to automatically enforce the desired recv timeout. This may not be well-supported cross-platform (e.g. DX100 C++ server code), but could be #ifdef controlled to the linux side only. This would be a relatively painless fix, I think.

We should probably log this as an enhancement, or continue discussion on the original bug, since I don't know that the current discussion is directly relevant to Libor's error messages.

@shaun-edwards
Copy link
Member

This is probably a good solution, but how would we test it? Has anybody seen this error before? Is there anything special about Libor's setup that would cause this? Perhaps network topology? We've always communicated with the robot on a small (single switch) network.

@JeremyZoss
Copy link
Member

I think testing would be fairly easy. We could "hand-craft" TCP message packets that are either the combination of two SimpleMessage packets, or split a single message into multiple packets. These packets could be sent (with optional time-delays) to test the merged/split packet handling. I think that with TCP_NODELAY set, all packets should be sent "immediately".

I do remember this being an issue before. I can't remember the details, but I think it might have been Ed Venator using a wireless link to his ABB robot. The wireless delays caused split packets.

@gavanderhoorn
Copy link
Member

If you go the receive timeout way, why not just use a select(..) with O_NONBLOCK on the socket. That is the 'normal' way AFAIK. Again, #ifdefed, as I don't think many of the controllers -- if they support C/C++ -- will support select.


As for the multiple-packets-in-single-TCP-frames or single-packet-over-multiple-TCP-frames: I've been able to reproduce that fairly easily (at least for the Fanuc). Either set a very high update rate on the controller side (so send out a lot of packets/sec), or configure the socket in 'interactive' mode. The latter is probably mapped onto TCP_NODELAY: every write results in a packet. These two result in both extremes.

@gavanderhoorn
Copy link
Member

As for testing: if you mean manually: there are tools that can replay a Wireshark capture, I think I can get some fairly complicated (de)serialisation situations using the technique described above. If you mean automatically: just create some hand-crafted packets (byte buffers) and feed them to the deserialisation routines by hand. That should be doable. (right, JeremyZoss already said that)

@JeremyZoss
Copy link
Member

select-based timeouts would work, and are probably more "standard", as you suggest. But I think the SO_RCVTIMEO method is more elegant (when supported), because it can be dropped into the current implementation with minimal changes. It also allows you to block until the desired # of bytes are read, while select() only considers whether any data is available or not.

In my opinion, select is a cleaner way to do the implementation you originally suggested (loop over recv(NOWAIT) until timeout or all data received). This eliminates the need to do a "polling" or "busy-wait" loop waiting for data to arrive. The select() call will block until data (or timeout).

But I still think the SO_RCVTIMEO option is more elegant, if it works. That's just my vote, though. I'm probably happy to leave it up to whoever feels motivated to do the implementation. =)

@ghost ghost assigned shaun-edwards Jul 25, 2013
@shaun-edwards
Copy link
Member

Another related issue (google code): https://code.google.com/p/swri-ros-pkg/issues/detail?id=55

shaun-edwards added a commit that referenced this issue Aug 8, 2013
…ecuted in parallel with colliding on the same ports
@shaun-edwards
Copy link
Member

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants