Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arpwatch 100% cpu with libpcap >=1.5.1 #333

Closed
seblu opened this issue Dec 7, 2013 · 9 comments
Closed

arpwatch 100% cpu with libpcap >=1.5.1 #333

seblu opened this issue Dec 7, 2013 · 9 comments

Comments

@seblu
Copy link

seblu commented Dec 7, 2013

I found a regression in libpcap between 1.4.0 and >=1.5.1 which cause arpwatch to consume 100% of CPU and stop working when listening on a bridge interface on i686.

I'm currently not able to reproduce it on another computer (vm).

I tested with fresh 1.5.2 and regression is still present.
I also tested with different linux kernel version 3.9, 3.10, 3.11 and 3.12.
All of this is tested on i686 archlinux host which act as router/firewall.

strace and ltrace give nothing. No output when process is running at 100%. So I guess the process run in loop inside a library call.

# ltrace -Sp $(pidof arpwatch)
^C

The gdb backtrace (see lower) show that the program is trapped in pcap_read_linux_mmap_v3 around ./pcap-linux.c:44xx (1.5.2 tarball)

# gdb arpwatch                                   
GNU gdb (GDB) 7.6.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/arpwatch...(no debugging symbols found)...done.
(gdb) run -d -f /tmp/sex.dat -i brlan
Starting program: /usr/bin/arpwatch -d -f /tmp/sex.dat -i brlan
warning: Could not load shared library symbols for linux-gate.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
^C
Program received signal SIGINT, Interrupt.
0xb7f85d4d in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4423
4423    ./pcap-linux.c: No such file or directory.
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85d62 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4423
4423    in ./pcap-linux.c
(gdb) bt
#0  0xb7f85d62 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, 
    user=0x0) at ./pcap-linux.c:4423
#1  0xb7f8ac5d in pcap_loop (p=0x818f008, cnt=<optimized out>, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap.c:856
#2  0x08049a32 in main ()
(gdb) https://github.com/the-tcpdump-group/libpcap.git^CQuit
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85d46 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4413
4413    in ./pcap-linux.c
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85e3b in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4467
4467    in ./pcap-linux.c
(gdb) n
4412    in ./pcap-linux.c
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85d35 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4412
4412    in ./pcap-linux.c
(gdb) c
ContinuiGNU gdb (GDB) 7.6.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/arpwatch...(no debugging symbols found)...done.
(gdb) run -d -f /tmp/sex.dat -i brlan
Starting program: /usr/bin/arpwatch -d -f /tmp/sex.dat -i brlan
warning: Could not load shared library symbols for linux-gate.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
^C
Program received signal SIGINT, Interrupt.
0xb7f85d4d in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4423
4423    ./pcap-linux.c: No such file or directory.
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85d62 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4423
4423    in ./pcap-linux.c
(gdb) bt
#0  0xb7f85d62 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, 
    user=0x0) at ./pcap-linux.c:4423
#1  0xb7f8ac5d in pcap_loop (p=0x818f008, cnt=<optimized out>, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap.c:856
#2  0x08049a32 in main ()
(gdb) https://github.com/the-tcpdump-group/libpcap.git^CQuit
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85d46 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4413
4413    in ./pcap-linux.c
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85e3b in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4467
4467    in ./pcap-linux.c
(gdb) n
4412    in ./pcap-linux.c
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85d35 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4412
4412    in ./pcap-linux.c
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xb7f85d57 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4423
4423    in ./pcap-linux.c
(gdb) 
ng.
^C
Program received signal SIGINT, Interrupt.
0xb7f85d57 in pcap_read_linux_mmap_v3 (handle=0x818f008, max_packets=0, callback=0x804a3e0 <process_ether>, user=0x0)
    at ./pcap-linux.c:4423
4423    in ./pcap-linux.c
(gdb) 

Do you need more information?

[1] https://projects.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/libpcap

@guyharris
Copy link
Member

This is in the TPACKET_V3 code, which is new in 1.5, and has not been changed much since 1.5.0, so 1.5.2 probably has the same problem.

Are the machines on which you couldn't reproduce this running Linux with a 3.2 or later kernel (and thus with TPACKET_V3 support), or are they either not running Linux or running with an earlier kernel (and thus without TPACKET_V3 support)?

@seblu
Copy link
Author

seblu commented Dec 7, 2013

The machine where I couldn't reproduce is mainly the same, up-to-date archlinux i686 with 3.12 kernel except it's a vm and the network stack. But after more try, I can now reproduce it on a VM.

Recipe:

  • Create bridge,
  • Put your eth iface inside it
  • Run dhcpcd on the brige to get an IP.
  • Run arpwatch on the bridge
  • After few seconds, you get arpwartch in 100% CPU.

The key ingredient is to put an iface in a bridge.

@mcr
Copy link
Member

mcr commented Dec 8, 2013

Sébastien Luttringer notifications@github.com wrote:
> I found a regression in libpcap between 1.4.0 and >=1.5.1 which cause arpwatch
> to consume 100% of CPU and stop working when listening on a bridge interface on
> i686.

> I'm currently not able to reproduce it on another computer (vm).

> I tested with fresh 1.5.2 and regression is still present.
> I also tested with different linux kernel version 3.9, 3.10, 3.11 and 3.12.
> All of this is tested on i686 archlinux host which act as router/firewall.

> strace and ltrace give nothing. No output when process is running at 100%. So I
> guess the process run in loop inside a library call.

So, my guess is that it is cycling through the receive ring without stop, not
detecting when the ring is empty, when it should return into the select()
loop.

I know little about the TPACKET_V3 code, so this is a guess.

Clearly, you need to have an interface on the bridge interface so that at
least one packet arrives. I am unclear if dhclient is critical: can you say?
dhclient uses pcap, likely not TPACKET_v3 though, to read/write traffic.
Maybe different TPACKET_* versions do not mix well.

] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

@guyharris
Copy link
Member

OK, so on my Fedora 16 (3.6.11-4.fc16.x86_64 kernel) virtual machine (VMware Fusion) I did

sudo brctl addbr br0
sudo brctl addif br0 eth0
sudo ifconfig br0 up
sudo ifconfig br0 172.16.135.190 netmask 255.255.255.0
sudo route del default
sudo route add default gw 172.16.135.2

and tried running tcpdump (with both it and the libpcap it uses built from the trunk) and arpwatch (installed from the standard RPM, and probably using the libpcap the OS comes with, which isn't 1.5) on br0, and didn't get the infinite loop.

So I'm not sure what's required to provoke the bug. Perhaps it's kernel-version related.

Debugging this might require running modified versions of libpcap on the machine on which you reproduced it.

@guyharris
Copy link
Member

strace and ltrace give nothing. No output when process is running at 100%.

I.e., you run strace against the arpwatch process, and it prints nothing, meaning it's probably executing no system calls, not even a poll()?

@seblu
Copy link
Author

seblu commented Dec 12, 2013

I mean, when the process reach 100% CPU, poll output stop.

# strace -p $(pidof arpwatch)
Process 14725 attached
poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])

and the process is running (not in a syscall)

# ps aux|grep arpwa
root     14725 79.9  0.6   7516  3460 ?        R    00:42   0:31 arpwatch -f /tmp/sex.dat -i sex

An important ingredient to trigger the 100% seems to send arp request.
Here, 192.168.242.2 is the ip of the host running the arpwatch, done by another host on the same ethernet network.

# arping -I eth0 192.168.241.2
ARPING 192.168.241.2 from 192.168.241.6 eth0
Unicast reply from 192.168.241.2 [08:00:27:F6:3D:75]  2.143ms

@guyharris
Copy link
Member

Another important ingredient is to run a version of arpwatch that's actually linked with a new version of libpcap. :-)

I was able to reproduce it, and it was a simple error (the negation of max_packets <= 0 is max_packets > 0, not max_packets >= 0).

Arpwatch is what triggered it because arpwatch was passing a count of 0, not -1, to pcap_loop(), and the bug in question meant that a count of 0 was not handled correctly in pcap_loop().

Fixed in 1a52c9a.

@guyharris
Copy link
Member

I've backported 1a52c9a to the 1.5 branch.

@ghost ghost assigned guyharris Dec 14, 2013
@guyharris
Copy link
Member

I was able to reproduce this without any bridging, so the bridging had nothing to do with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants