Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected latency behaviour #78

Closed
Alex031544 opened this issue Nov 11, 2016 · 7 comments
Closed

Unexpected latency behaviour #78

Alex031544 opened this issue Nov 11, 2016 · 7 comments

Comments

@Alex031544
Copy link

Alex031544 commented Nov 11, 2016

Objectives

We make some investigations to find out the dependencies of transport latencies between different nodes. Therefore we made an set up with one sender-receiver-computer (IOM) and one repeater-computer (CPM). The IOM publishes defined data-packages on a constant frequency f with the sending timestamp t1 to the network. The CPM subscribes this, takes the receiving time t2, copies the data and publishes the data on time t3. The IOM subscribes at t4 and logs the data.

Afterwards the transport times between the sender-receiver-computer and the repeater-computer and vice versa shall be calculated. We assume them to be equal (Dt = t2-t1 = t4-t3) and use the formula

Dt = ( (t4-t1) - (t3-t2) ) / 2

which means, that the clocks of both computers don't have to be synchronized.

Problem

We assume the transport time Dt to be independent from the sending frequency f, as long as the network is not overloaded. But our results show an linear increasing transport time Dt on linear decreasing frequencies f (as you can see in the following section). In the same time, our timestamps t1 and t2 show that the sending frequency seems to be correct.

Results

Below you will see the results of our investigation. Information to our set up you will find in the next section. We do some tests with LAN and WLAN. In both cases there are some strange behaviours. Regarding the latency-issue it's on both the same. The ordinate axis is limited to 20 ms for a better readability. Latencies bigger than 20 ms are counted as packages out of bounds. So lets take a look to the WLAN test results:

Configuration with a D-Link DIR 510L (5G Hz WLAN)

cfg02_100hz_100s_01

Configuration with a D-Link DIR 510L (2G4 Hz WLAN)

cfg03_100hz_100s_01

Configuration with a TP-Link M7350 (5G Hz WLAN)

  • 50 Hz
    cfg05_050hz_200s_01
  • 100 Hz
    cfg05_100hz_100s_01
  • 200 Hz
    cfg05_200hz_050s_01

Configuration with a TP-Link M7350 (2G4 Hz WLAN)

  • 50 Hz
    cfg06_050hz_200s_01
  • 100 Hz
    cfg06_100hz_100s_01
  • 200 Hz
    cfg06_200hz_050s_01

For us the results are not as expected. Can you help us to find out the reason for it? Is there something like a fixed buffer or something else we don't see at the moment???

Set up

The computers are Raspberry Pi 2 with Raspbian installed on. We did our investigation with openSplice and FastRTPS. But there are no significant differences related to this behaviour. The ROS2 build is alpha7.

Below you will find the code for this two nodes:

sender-receiver-computer (IOM)

  • iom.cpp
#include "rclcpp/rclcpp.hpp"
#include <string>

#include "test_lat_msg/msg/test_lat_msg.hpp"

// reference time stamp
static std::chrono::steady_clock::time_point t_start;
static FILE *fd_log;

void cb_subscriber(const test_lat_msg::msg::LatMsgCpm::SharedPtr msg)
{
	// calc receive time
	uint64_t t4 = (std::chrono::steady_clock::now()-t_start).count();
	
	// store received data to logfile
	fwrite (&msg->id, sizeof(msg->id), 1, fd_log);
	fwrite (&msg->t1, sizeof(msg->t1), 1, fd_log);
	fwrite (&msg->t2, sizeof(msg->t2), 1, fd_log);
	fwrite (&msg->t3, sizeof(msg->t3), 1, fd_log);
	fwrite (&t4, sizeof(t4), 1, fd_log);

	fwrite (&(msg->payload[0]), sizeof(uint8_t), 32, fd_log);

	fflush(fd_log);
}

int main(int argc, char * argv[])
{
	// aircraft number with default value - 0 should not use as valid value
	int flzNo = 0;
	// frequency of data output [Hz]
	int freq  = 10;
	
	// replace default values if some arguments given
	// first argument: aircraft number
	if( argc > 1)
	{
		flzNo = std::stoi(argv[1]);
	}
	// second argument: frequency of data output
	if( argc > 2)
	{
		freq = std::stoi(argv[2]);
	}

	// create a log-file with a unique name
	...

	// create names for node and topics
	...

	// initialize ROS2
	rclcpp::init(argc, argv);
	// initialize a node
	auto node = rclcpp::node::Node::make_shared(flzNodeName);
	// initialize a puplisher for IOM-messages
	auto publisher = node->create_publisher<test_lat_msg::msg::LatMsgIom>
	  (iomTopicName, rmw_qos_profile_sensor_data);

	auto subscriber = node->create_subscription<test_lat_msg::msg::LatMsgCpm>
	  (cpmTopicName, cb_subscriber, rmw_qos_profile_sensor_data);

	// take a reference time stamp
	t_start = std::chrono::steady_clock::now();

	// declare a publisher message
	test_lat_msg::msg::LatMsgIom pupmsg;
	// add aircraft number as IOM-ID to the puplisher message
	pupmsg.iomid = flzNo;
	// create a payload
	for(uint64_t i=0; i<32; i++)
	{
		pupmsg.payload[i] = i;
	}

	// set loop rate for publishing messages
	rclcpp::WallRate loop_rate(freq);
	// a counter for the message id
	uint64_t cnt = 0;
	while ( rclcpp::ok() )
	{
		pupmsg.id = cnt++;
		// put sending time to the message
		pupmsg.t1 = (std::chrono::steady_clock::now()-t_start).count();
		publisher->publish(pupmsg);

		// iterate callbacks		
		rclcpp::spin_some(node);
		// time control of the loop
		loop_rate.sleep();
	}

	return 0;
}

epeater-computer (CPM)

  • cpm.cpp
#include "rclcpp/rclcpp.hpp"
#include <string>
#include "CpmRelay.h"

int main(int argc, char * argv[]){
	// count of aircrafts or rather count of relays provided by this CPM
	// the default value is 5
	uint8_t flzCnt = 5;
	
	// replace default value if some arguments given
	// first argument: aircraft count
	if( argc > 1){
		flzCnt = std::stoi(argv[1]);
	}

	// initialize ROS2
	rclcpp::init(argc, argv);
	// initialize a node
	auto node = rclcpp::node::Node::make_shared("test_cpm_test");

	// create an instance for each relay
	std::vector<test::CpmRelay*> cpmRelay;
	for(int i=1; i<=flzCnt; i++){
		cpmRelay.push_back(new test::CpmRelay(node, i));
	}

	std::cout << "spin" << std::endl;
	rclcpp::spin(node);

	return 0;
}
  • CpmRelay.h

#ifndef CPMRELAY_H_INCLUDED
#define CPMRELAY_H_INCLUDED

#include "rclcpp/rclcpp.hpp"
#include <string>

#include "test_lat_msg/msg/test_lat_msg.hpp"

namespace test
{

class CpmRelay
{

private:
	uint8_t flzNo;
	std::string cpmTopicName;
	std::string iomTopicName;

	rclcpp::node::Node::SharedPtr node;
	
	rclcpp::publisher::Publisher<test_lat_msg::msg::LatMsgCpm>::SharedPtr
		publisher;
	
	rclcpp::subscription::Subscription<test_lat_msg::msg::LatMsgIom>::SharedPtr
		subscriber;

	test_lat_msg::msg::LatMsgCpm pupmsg;
	
	/// offset-time / start time
	std::chrono::steady_clock::time_point t_start;

public:
	CpmRelay( rclcpp::node::Node::SharedPtr node, uint8_t flzNo );

private:
	void cb_subscriber(const test_lat_msg::msg::LatMsgIom::SharedPtr msg);
};

}

#endif // CPMRELAY_H_INCLUDED
  • CpmRelay.cpp
#include "rclcpp/rclcpp.hpp"
#include <string>

#include "test_lat_msg/msg/test_lat_msg.hpp"

#include "CpmRelay.h"

test::CpmRelay::CpmRelay( rclcpp::node::Node::SharedPtr pnode, 
                          uint8_t pflzNo)
{ 
	this->flzNo = pflzNo;
	this->node = pnode;

	// get start time
	this->t_start = std::chrono::steady_clock::now();

	cpmTopicName  = "t_flz" + std::to_string(flzNo) + "_cpm";
	iomTopicName  = "t_flz" + std::to_string(flzNo) + "_iom";
	
	std::cout << "Relay no. " << std::to_string(flzNo) << " listen to " 
	          << iomTopicName << " and writes to " << cpmTopicName << std::endl;

	publisher = node->create_publisher<test_lat_msg::msg::LatMsgCpm>
	  (cpmTopicName, rmw_qos_profile_sensor_data);

	using namespace std::placeholders; // for `_1`
	subscriber = node->create_subscription<test_lat_msg::msg::LatMsgIom>
	  (iomTopicName, 
	   std::bind(&CpmRelay::cb_subscriber, this,_1),
	   rmw_qos_profile_sensor_data);
}

void
test::CpmRelay::cb_subscriber(const test_lat_msg::msg::LatMsgIom::SharedPtr msg)
{
	// get relative receiving time
	pupmsg.t2 = (std::chrono::steady_clock::now()-t_start).count();

	// copy data
	pupmsg.iomid = msg->iomid;
	pupmsg.id = msg->id;
	pupmsg.t1 = msg->t1;
	pupmsg.payload = msg->payload;

	// get relative sending time
	pupmsg.t3 = (std::chrono::steady_clock::now()-t_start).count();
	publisher->publish(pupmsg);
}

Feel free to ask for further information.

Kind regards,
Alex

@tfoote
Copy link

tfoote commented Nov 12, 2016

At first glance you're using spin_some in the publish loop and then sleeping until the next publish. I would expect that you then are only giving rclcpp the opportunity to check for incoming messages each time you publish, which thus causes your latency to be correlated to your publish frequency.

I'd suggest refactoring to make sure to have spin active all the time by putting it in a dedicated thread. Or even simpler would be to setup a timer to trigger your publish events and then you can just call spin in the main thread.

@Alex031544
Copy link
Author

Hello,

thanks for your replay. This makes sense and was very helpful. Now the latencies are constant over the frequency. This is totally what we expected. But there is a new observation: During the transmission data losses occur over a longer time. This leak time increases with the frequency and occurs only with WLAN connections. Have you an idea why this happens?

Obeservations

D-Link DIR-645 (LAN)

  • 25 Hz
    cfg01_025hz_400s_01
  • 50 Hz
    cfg01_050hz_200s_01
  • 100 Hz
    cfg01_100hz_100s_01
  • 200 Hz
    cfg01_200hz_050s_01
  • 400 Hz
    cfg01_400hz_025s_01

TP-Link M7350 (5 GHz WLAN)

  • 25 Hz
    cfg05_025hz_400s_01
  • 50 Hz
    cfg05_050hz_200s_01
  • 100 Hz
    cfg05_100hz_100s_01
  • 200 Hz
    cfg05_200hz_050s_01
  • 400 Hz
    cfg05_400hz_025s_01

TP-Link M7350 (2G4 Hz WLAN)

  • 25 Hz
    cfg06_025hz_400s_01
  • 50 Hz
    cfg06_050hz_200s_01
  • 100 Hz
    cfg06_100hz_100s_01
  • 200 Hz
    cfg06_200hz_050s_01
  • 400 Hz
    cfg06_400hz_025s_01

Setup

Now I seperate the sender and receiver tasks in two classes. The sender has an own thread for his work, while the receiver only bind the callback function to the topic. The spin comes directly before the class-destructors inside the main function:

    // initialize ROS2
    rclcpp::init(argc, argv);
    // initialize a node
    auto node = rclcpp::node::Node::make_shared(flzNodeName);

    // take a reference time stamp
    auto t_start = std::chrono::steady_clock::now();

    auto *iomReceiver = new test::IomReceiver( node, flzNo, t_start );
    auto *iomSender = new test::IomSender( node, flzNo, t_start, freq );

    rclcpp::spin(node);

    delete iomReceiver;
    delete iomSender;

The QoS-profile is again rmw_qos_profile_sensor_data.

@dirk-thomas
Copy link
Member

When using best-effort I would expect messages to be lost when being sent over a wireless connection. If packet collision happens the message will not be resend and therefore never be received. If you switch to reliable that should not happen.

@Alex031544
Copy link
Author

Alex031544 commented Nov 14, 2016

Yes this is clear. But what I don't see why this happens in that way. The network has totally only this two "members" and only this data exchange. Over a longer period with some hundred messaged all goes perfectly. And then comes a short period of collisions? Why, all of a sudden, collisions? Maybe there is something black magic from the router?

To solve this by setting the profile to be reliable truly is a way. But on our 400 Hz measurement this will lead to hold maybe up to 80 messages. Is it right to say this will additionally increase the data traffic? If it is so I think maybe it is a good thing to think about the reasons for this. Because the bandwidth is limited and we go to have up to 10 IOMs in a further development step.

@dirk-thomas
Copy link
Member

You might want to run wireshark on that interface and look at the exchanged packets. A lot of stuff is going on in a (wireless) network even if it is "idle" from the user point of view.

@Alex031544
Copy link
Author

Alex031544 commented Nov 14, 2016

Yeah - I think we should examine this. Sad, that the small modern mobile routers don't support a good QoS-set-up to avoid disturbing parallel communications. (DIR-645 <> DIR510L / TP-Link M7350)

On a DIR-645 it's possible to set up priorities and it seems to lead to better results.

@mikaelarguedas
Copy link
Member

@Alex031544 I'm going to close this due to the low activity for a while. If you want to share the results of your investigation, please feel free to comment here and we can reopen this if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants