attributes responses to the wrong host after a while (Debian #525431) #12

Closed
xtaran opened this Issue May 27, 2012 · 2 comments

Comments

Projects
None yet
3 participants
Contributor

xtaran commented May 27, 2012

Initially reported at http://bugs.debian.org/525431 by Peter Folk support-debian@volo.net. Citing his report:

I use fping a lot, to monitor several hundred hosts. I did not notice a problem until recently, but now when I'm pinging a host that is down along with others that are up, after a while fping gets confused about which host is which and ends up saying the wrong host is down. It's unclear whether it's attributing other packets to the wrong hosts, or only the ones that are down.

Case in point:

# 10.7.2.4 is known down, 10.3.1.5 is known up
fping -Al 10.3.1.5 four other random hosts 10.7.2.4
# normal response:
10.3.1.5 : [0], 96 bytes, 114 ms (114 avg, 0% loss)
10.3.1.5 : [1], 96 bytes, 205 ms (159 avg, 0% loss)
10.3.1.5 : [2], 96 bytes, 26.0 ms (115 avg, 0% loss)
ICMP Host Unreachable from 10.7.0.6 for ICMP Echo sent to 10.7.2.4
ICMP Host Unreachable from 10.7.0.6 for ICMP Echo sent to 10.7.2.4
ICMP Host Unreachable from 10.7.0.6 for ICMP Echo sent to 10.7.2.4
# ...
# Then after a while (an hour or so):
10.7.2.4 : [0], 96 bytes, 114 ms (114 avg, 0% loss)
10.7.2.4 : [1], 96 bytes, 205 ms (159 avg, 0% loss)
10.7.2.4 : [2], 96 bytes, 26.0 ms (115 avg, 0% loss)
ICMP Host Unreachable from 10.7.0.6 for ICMP Echo sent to 10.3.1.5
ICMP Host Unreachable from 10.7.0.6 for ICMP Echo sent to 10.3.1.5
ICMP Host Unreachable from 10.7.0.6 for ICMP Echo sent to 10.3.1.5

Using tcpdump, I can verify that the Host Unreachable messages are correct (from 10.7.0.6 for ICMP Echo sent to 10.7.2.4), but fping is reporting them wrong. It doesn't happen with every set of arguments and because it takes some time to reproduce, I haven't been able to find an exact cause (minimal test case). That said, the bug appears to have been introduced recently (say, some time this year).

His report is from 24 Apr 2009 and since there was no upload of fping to Debian in 2009, it may have been introduced by one of the uploads in 2008:

http://packages.qa.debian.org/f/fping/news/20081018T113203Z.html
http://packages.qa.debian.org/f/fping/news/20080303T233206Z.html

The next older upload was from 2006 so if the issue was really introduced by some Debian patches, it's likely that it was a patch introduced in one of those two uploads.

You can find the according Debian source packages at http://snapshot.debian.org/package/fping/

I am seeing this as well. When using fping on a list of 300k hosts, this basically happens for all hosts after a while.

After 50798 hosts, fping starts pinging the host from the beginning of the list and additionally messes with the loss/return statistics.

0.0.0.0 : [0], 96 bytes, 0.10 ms (0.10 avg, 0% loss) [<- 127.0.0.1]
100.0.198.71 : [0], 96 bytes, 86.8 ms (86.8 avg, 0% loss)
100.0.198.79 : [0], 96 bytes, 91.3 ms (91.3 avg, 0% loss)
100.42.208.126 : [0], 96 bytes, 158 ms (158 avg, 0% loss)
... <skipping ~50000 hosts>
188.165.20.134 : [0], 96 bytes, 17.3 ms (17.3 avg, 0% loss)
188.165.20.166 : [0], 96 bytes, 18.1 ms (18.1 avg, 0% loss)
188.165.20.177 : [0], 96 bytes, 15.9 ms (15.9 avg, 0% loss)
188.165.201.77 : [0], 96 bytes, 17.3 ms (17.3 avg, 0% loss)
188.165.202.144 : [0], 96 bytes, 15.6 ms (15.6 avg, 0% loss)
188.165.20.235 : [0], 96 bytes, 13.0 ms (13.0 avg, 0% loss)
188.165.203.141 : [0], 96 bytes, 17.4 ms (17.4 avg, 0% loss)
0.0.0.0 : [0], 96 bytes, 16.3 ms (8.22 avg, 200% return) [<- 188.165.203.148]
100.0.198.71 : [0], 96 bytes, 17.3 ms (52.0 avg, 200% return) [<- 188.165.203.157]
100.0.198.79 : [0], 96 bytes, 17.6 ms (54.5 avg, 200% return) [<- 188.165.20.38]
10.0.0.207 : [0], 96 bytes, 18.0 ms (18.0 avg, 0% loss) [<- 188.165.203.92]
10.0.0.244 : [0], 96 bytes, 17.5 ms (17.5 avg, 0% loss) [<- 188.165.204.138]
10.0.0.4 : [0], 96 bytes, 17.6 ms (17.6 avg, 0% loss) [<- 188.165.204.207]

You can see that fping gets confused about which host it is actually pinging. It seems to ping the correct host, but the 200% return and the "wrong ip" data looks wrong.

Owner

schweikert commented Jul 23, 2013

#48 also reported this, but also found out that it is a integer boundary issue. An experimental fix is in the "seqmap" branch. Please test.

schweikert added a commit that referenced this issue Jul 23, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment