Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RasperryPi3 Wireless communication problem #315

Closed
MiguelCompany opened this issue Sep 10, 2019 · 16 comments
Closed

RasperryPi3 Wireless communication problem #315

MiguelCompany opened this issue Sep 10, 2019 · 16 comments
Labels
bug Something isn't working

Comments

@MiguelCompany
Copy link
Contributor

Note: This was initially reported by @codebot by e-mail to eProsima. It is posted here following a suggestion by @dirk-thomas

Bug report

Required Info:

  • Operating System:
    • Ubuntu 18.04
  • Installation type:
    • binaries
  • Version or commit hash:
    • dashing (ros-dashing-rmw-fastrtps-cpp is 0.7.5-1bionic)
  • DDS implementation:
    • Fast-RTPS

Steps to reproduce issue

# On Machine 1 (x86 workstation with Ubuntu 18.04 and ROS 2 Dashing)
ros2 run examples_rclpy_minimal_publisher publisher_old_school

# On Machine 2 (Raspberry Pi 3 with Ubuntu 18.04 and ROS 2 Dashing)
ros2 run examples_rclpy_minimal_subscriber subscriber_old_school

Expected behavior

Strings published from Machine 1 start printing on the console of Machine 2, after allowing a few seconds for discovery and connection.

Actual behavior

Usually, nothing is printed on the console of Machine 2. If left to run for a long time, sometimes after ~3 to ~5 minutes a few strings will print to the console intermittently, but usually (probably 90-99% of the time) it is not receiving messages. I don't know if this is due to WiFi jitter/latency/packet-drops on the RPi3, or the relatively slow CPU and I/O on the RPi3 causing some timeouts to be missed, or what's going on exactly. This behavior appears "sometimes" on laptops on WiFi, but it is more hit-and-miss. With the RPi3 it's typically much easier to reproduce the issue.

Additional information

Router model (in case it matters): TP-Link Archer C4000
Machine 1: connected via (wired) Gigabit Ethernet
Machine 2: connected via the RPi3 built-in WiFi, which is not super awesome

@MiguelCompany
Copy link
Contributor Author

MiguelCompany commented Sep 10, 2019

At eProsima we haven't been able to reproduce this. We have tested the following environment:

Machines

Machine 1: Raspberry Pi 3 Model B Plus Rev 1.3 with Ubuntu 18.04 and ROS 2 Dashing from binaries

Installed following this guide
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:        18.04
Codename:       bionic

$ uname -a
Linux ubuntu 4.15.0-1044-raspi2 #47-Ubuntu SMP PREEMPT Thu Aug 15 14:11:00 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

Machine 2: x86 Laptop (Dell XPS 15 9560) with Ubuntu 18.04 and ROS 2 Dashing from binaries

ROS2 installed following official documentation
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic

$ uname -a
Linux suxen 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Machine 3: x86 Laptop (Dell XPS 15 9560) with Windows 10 and ROS 2 Dashing from binaries

ROS2 installed following official documentation

Network connection

Router model: Asus RT-N12E
IGMP Snooping: Tested both ON and OFF with same results
Machine 1: Connected via WiFi
Machine 2: Connected via Gigabit Ethernet cable
Machine 3: Connected via WiFi

Basic test

Commands run

First run the subscribers on the three machines

ros2 run examples_rclpy_minimal_subscriber subscriber_old_school

Then open another terminal and start a publisher on each machine.

ros2 run examples_rclpy_minimal_publisher publisher_old_school

Results

Whenever a publisher was started, all subscribers started receiving messages (first index was either 0 or 1). After 10 minutes, no messages were lost (although sometimes the subscriber on the RPi stalled for a while and then showed a burst of messages received)

Additional test 1

On RPi terminal 1, repeat

# Hit Ctrl-C to stop subscriber
ros2 run examples_rclpy_minimal_subscriber subscriber_old_school
# Wait for messages to be received again

Results

  • Most of the time it starts receiving messages from all publishers after less than 5 seconds.
  • Sometimes (1 out of 100) messages from 1 publisher start after 40 seconds.
  • Sometimes (1 out of 100) messages from 2 publishers start after 40 seconds.

Additional test 2

On Ubuntu laptop, terminal 2, repeat:

# Hit Ctrl-C to stop publisher
ros2 run examples_rclpy_minimal_publisher publisher_old_school
# Wait for messages on RPi, terminal 1 to be received again

Results

Most of the time it starts receiving messages after less than 5 seconds.
Sometimes (1 out of 100) messages start after 40 seconds.

Additional test 3

repeat

# Remove ethernet cable on Ubuntu laptop
# Wait for 10 seconds
# Plug ethernet cable on Ubuntu laptop
# Wait for messages on RPi to be received again

Results

RPi starts receiving messages after less than 2 seconds.

Additional test 4

repeat

# Remove ethernet cable on Ubuntu laptop
# Wait for 2 minutes
# Plug ethernet cable on Ubuntu laptop
# Wait for messages on RPi terminal 1 to be received again

Results

  • Most of the time all terminals start receiving messages after less than 5 seconds.
  • 1 out of 100 RPi starts receiving messages after 40 seconds

Conclusion

Sometimes the multicast message informing of the presence of a participant is lost, and communication is restored when the periodical one is resent

Additional remarks

The only multicast traffic on the network was on the participant discovery IP address 239.255.0.1

@codebot
Copy link
Member

codebot commented Sep 11, 2019

We are seeing quite different results on our network (TP-Link Archer C4000). We just re-ran a few experiments after updating to the latest ROS 2 Dashing packages. All these experiments are with examples_rclpy_minimal_publisher/publisher_old_school running on Machine 1 (RPi3) and examples_rclpy_minimal_subscriber/subscriber_old_school running on Machine 2 (workstation or laptop). There is some background traffic on our network from other developers, but we all have different ROS_DOMAIN_ID values. We ran these experiments with multiple different boxes for "Machine 2" to try to reduce the possibility of strange network interface behavior impacting the results. Each "Machine 2" box was running a fully updated installation of Ubuntu 18.04 with ROS 2 Dashing.

Experiments

Experiment 1 (FastRTPS)

Machine 1: RPi3 connected over WiFi
Machine 2: Workstation connected over Ethernet
Middleware: rmw_fastrtps_cpp
Result (~5 times repeated): typically no messages received. Sometimes after several minutes, a few messages will print, and then it goes "silent" again

Experiment 2 (FastRTPS)

Machine 1: RPi3 connected over WiFi
Machine 2: laptop connected over WiFi
Middleware: rmw_fastrtps_cpp
Result (~5 times repeated): typically no messages received. Once, after several minutes, we saw a few messages print, but then it stopped printing again

Experiment 3 (Cyclone)

Machine 1: RPi3 connected over WiFi
Machine 2: Workstation connected over Ethernet
Middleware: rmw_cyclonedds_cpp
Result (~5 times repeated): as expected, messages start printing within a few seconds and stay printing for the duration of the experiment

Experiment 4 (Cyclone)

Machine 1: RPi3 connected over WiFi
Machine 2: laptop connected over WiFi
Middleware: rmw_cyclone_cpp
Result (~5 times repeated): as expected, messages start printing within a few seconds and stay printing for the duration of the experiment

Conclusion

For unknown reasons, FastRTPS does not seem to reliably discover and send messages on our network. However, on the same machines and network infrastructure, CycloneDDS appears to reliably discover and send messages.

@MiguelCompany
Copy link
Contributor Author

@codebot Could you update your comment sharing the RPi3 model (as seen at the beginning of dmesg output), as long as the output from lsb_release -a and uname -a ?

@aaronchongth
Copy link

Hi @MiguelCompany! here are the outputs requested,

ubuntu@ubuntu:~$ dmesg
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[    0.000000] Linux version 4.15.0-1045-raspi2 (buildd@bos02-arm64-003) (gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)) #49-Ubuntu SMP PREEMPT Thu Sep 5 11:27:35 UTC 2019 (Ubuntu 4.15.0-1045.49-raspi2 4.15.18)
[    0.000000] Machine model: Raspberry Pi 3 Model B Rev 1.2
ubuntu@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.3 LTS
Release:	18.04
Codename:	bionic
ubuntu@ubuntu:~$ uname -a
Linux ubuntu 4.15.0-1045-raspi2 #49-Ubuntu SMP PREEMPT Thu Sep 5 11:27:35 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

@mabelzhang mabelzhang added the bug Something isn't working label Sep 12, 2019
@MiguelCompany
Copy link
Contributor Author

MiguelCompany commented Sep 12, 2019

@aaronchongth Thank you for posting this.

I've seen I am using model 3B+, so I've taken a 3B to repeat the tests with it. I had to upgrade it in order to match your kernel version, and after reboot I had error messages from the firmware, which I fixed following the instructions of this post. I am not saying that your problem is related to this, just putting it here so other users know of this issue.

I will repeat the tests and keep you posted here.

@MiguelCompany
Copy link
Contributor Author

MiguelCompany commented Sep 16, 2019

We repeated the tests on Thursday and Friday, using a Raspberry-Pi 3B rev 1.2 with kernel version 4.15.0-1045-raspi2 and had the same results as in my previous comment, so right now we cannot reproduce the problem.

@codebot
Copy link
Member

codebot commented Sep 16, 2019

It's interesting that the problems don't seem to show up in your testing. Here is another report with what I'm speculating is a similar root cause:
https://blog.roverrobotics.com/navigation2-now-were-getting-somewhere/

If you turn off IGMP Snooping, does discovery have problems? Our router has IGMP Snooping turned off by default.

@MiguelCompany
Copy link
Contributor Author

If you turn off IGMP Snooping, does discovery have problems? Our router has IGMP Snooping turned off by default.

I tested both with IGMP Snooping on and off, and noticed no difference. The performance of multicast over WiFi depends also on the number of devices connected to the same AP, and is usually bad if different connection kinds (2.4GHz vs 5GHz) are involved. Nevertheless, turning IGMP snooping on usually improves multicast communications performance and is usually recommended.

Regarding this, I checked the network traffic using cyclone and Fast-RTPS, and I noticed that cyclone is announcing the participant on multicast, but only with unicast locators on its DATA(p). This strategy may improve the discovery when only two participants are involved but will imply a lot of network traffic if several participants are started at the same time.

@MiguelCompany
Copy link
Contributor Author

MiguelCompany commented Sep 18, 2019

In order to reduce multicast traffic, the following DEFAULT_FASTRTPS_PROFILES.xml file can be used in the R-Pi. It should only be used on one participant, as this makes it stop listening on multicast.

<?xml version="1.0" encoding="UTF-8" ?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles" >
    <profiles>

        <!-- This participant profile:
             * Disables all multicast traffic except PDP.
             * Reduces lease duration and announcement.
        -->
        <participant profile_name="test_participant_profile" is_default_profile="true">
            <rtps>
                <builtin>
                    <!-- Specifying an empty metatraffic unicast 
                         locator disables multicast metatraffic
                    -->
                    <metatrafficUnicastLocatorList>
                        <locator/>
                    </metatrafficUnicastLocatorList>
                    
                    <!-- Specifying a multicast initial peer to standard address and port -->
                    <initialPeersList>
                        <locator>
                            <udpv4>
                                <address>239.255.0.1</address>
                                
                                <!-- NOTE: This should be changed to match the port
                                     corresponding to the domain id if we want participants
                                     with the default configuration to discover us.
                                     The corresponding port is 7400 + 250 * domain_id
                                -->
                                <port>7400</port>
                            </udpv4>
                        </locator>
                    </initialPeersList>
                    
                    <leaseDuration>
                        <sec>10</sec>
                        <nanosec>0</nanosec>
                    </leaseDuration>

                    <leaseAnnouncement>
                        <sec>1</sec>
                        <nanosec>0</nanosec>
                    </leaseAnnouncement>
                </builtin>
            </rtps>
        </participant>

    </profiles>
</dds>

@MiguelCompany
Copy link
Contributor Author

MiguelCompany commented Sep 26, 2019

Good news, we could reproduce the problem and have a solution.

We bought the same router @codebot was using on its report. Using that router (TP-Link Archer C4000) we did reproduce the issue. Looking at wireshark captures we saw that multicast traffic from the RPi-3 to the Wired-connected PC was not received. So while preparing countermeasures to reduce the multicast traffic to the minimum and improve discovery timing, we did a lot of tests regarding the multicast problem on the router. After all our tests, we found that when a device on the WLAN is both sending and receiving on the same multicast address, a device on LAN subscribed to that same address does not receive that multicast traffic. The same happens if the second device is connected to a different band of WLAN. An easy way to reproduce the problem:

# On LAN connected device
ros2 multicast receive
# On WLAN connected device
ros2 multicast receive &
ros2 multicast send

# Expected result: both devices receive the message
# Actual result: only WLAN device receive the message

In order to address this, we've developed several improvements and features directly on branch 1.8.x of Fast-RTPS, in order to make it as compatible as possible with ROS2 dashing. The changes break the ABI, so when checking the patch rmw_fastrtps repo has to be recompiled. Using this repos file we built and tested on Win10 & Ubuntu PCs and a RPi-3 with Ubuntu.

@dirk-thomas Would it be possible to perform a patch release of Dashing, changing the version of Fast-RTPS? We (both eProsima and Morgan) think this bug is really important to be addressed quickly.

@dirk-thomas
Copy link
Member

The referenced PR updated the Dashing repos file to the provided commit hash of FastRTPS. I also create a note on the project board to do a new release before the next sync: https://github.com/orgs/ros2/projects/12

@clalancette
Copy link
Contributor

@MiguelCompany Have these changes made it onto the master branch? The code has changed somewhat significantly, but it looks like master is still using the old values: https://github.com/eProsima/Fast-RTPS/blob/9d562024886f4e1e7be363356ab032ecba934490/include/fastrtps/rtps/attributes/RTPSParticipantAttributes.h#L187

@MiguelCompany
Copy link
Contributor Author

@clalancette We gave priority on fixing this for current users, and made the necessary changes on 1.8.x directly. We are getting this into 1.9.x through eProsima/Fast-DDS#744 and will cherry pick from there to master

@MiguelCompany
Copy link
Contributor Author

@clalancette The changes made their way to master on eProsima/Fast-DDS#760. I think this can be closed.

@clalancette
Copy link
Contributor

Sounds good, thanks. Will close it out.

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-default-behavior-wifi/13460/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants