Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in select() behaviour between 1.9.0 and 1.9.1 #860

Closed
royhills opened this issue Oct 11, 2019 · 1 comment
Closed

Possible regression in select() behaviour between 1.9.0 and 1.9.1 #860

royhills opened this issue Oct 11, 2019 · 1 comment

Comments

@royhills
Copy link

I beleive there may be a regression between libpap versions 1.9.0 and 1.9.1 when using select() on a file descriptor returned by pcap_get_selectable_fd().

Operating system: Linux (problem first reported on Arch Linux 2019.10.01 x64, reproduced on Debian 10 "Buster" x64)
Compiler: gcc

Summary of problem:

arp-scan ( https://github.com/royhills/arp-scan ) uses libpcap to receive ARP response packets. The arp-scan pcap code has worked up to and including libpcap 1.9.0, but fails on libpcap 1.9.1.

No errors are reported, but no ARP response packets are processed. The statistics from pcap_stats() are non-zero, indicating that packets are matching the filter expression.

The key pcap API calls go like this (much simplified):

pcap_handle = pcap_create(if_name, errbuf)
pcap_fd=pcap_get_selectable_fd(pcap_handle)
LOOP:
n = select(pcap_fd+1, &readset, NULL, NULL, &to)
if (n > 0)
pcap_dispatch(pcap_handle, -1, callback, NULL)

With libpcap 1.9.1, select() is always returning zero indicating nothing ready to read on pcap_fd, so pcap_dispatch() is never called.

The problem is also present on the latest git commit (ac945a4).

Running git bisect between tags/libpcap-1.9.0 and tags/libpcap-1.9.1 gives the following:

rsh@buster:~/libpcap$ git bisect log
git bisect start
# good: [0ff834006347cc131e1256804d4f8d55301b27f3] set version to release
git bisect good 0ff834006347cc131e1256804d4f8d55301b27f3
# bad: [d396f255cf7b96a09cf91d0e8cc94d23777d6986] bump version
git bisect bad d396f255cf7b96a09cf91d0e8cc94d23777d6986
# skip: [9046eead54730fd0b0f9a00ea8fe283fe35b3808] Get rid of extra blank line.
git bisect skip 9046eead54730fd0b0f9a00ea8fe283fe35b3808
# good: [41abf587e7f386be4b67ad3ab720afe55d3579c1] No, you don't use commas there.
git bisect good 41abf587e7f386be4b67ad3ab720afe55d3579c1
# good: [16970a4a176cd262743007bf53aea16321ec060b] Removing null check before free
git bisect good 16970a4a176cd262743007bf53aea16321ec060b
# bad: [2ade7676101366983bd4f86bc039ffd25da8c126] With a timeout of zero, specify a maximum-size retire timeout.
git bisect bad 2ade7676101366983bd4f86bc039ffd25da8c126
# good: [dab5b0ea6f0a2a4024ec5eafa21d12db2c659b0c] Make the 7 used when rounding up a size unsigned.
git bisect good dab5b0ea6f0a2a4024ec5eafa21d12db2c659b0c
# good: [2c615752fe71085a82c93e5cc2e5c758a6195d52] Clean up the code to parse /proc/net/dev.
git bisect good 2c615752fe71085a82c93e5cc2e5c758a6195d52
# good: [7732ed7a5fa325d951df3b8b1f1b121d7890e60a] Just point to the big "kludge" comment in gen_vlan().
git bisect good 7732ed7a5fa325d951df3b8b1f1b121d7890e60a
# good: [5211a3f3aa72ea1b43e2e81e8101c8b269e2f076] Fix missing underscore in pcap man page.
git bisect good 5211a3f3aa72ea1b43e2e81e8101c8b269e2f076
# first bad commit: [2ade7676101366983bd4f86bc039ffd25da8c126] With a timeout of zero, specify a maximum-size retire timeout.

See also the arp-scan issue at: royhills/arp-scan#42

To reproduce:

Run the latest version of arp-scan. A successful run will be able to recieve and display the ARP response packets like this:

rsh@buster:~/arp-scan$ sudo ./arp-scan --localnet
Interface: ens33, type: EN10MB, MAC: 00:0c:29:87:37:f9, IPv4: 192.168.159.155
Starting arp-scan 1.9.5 with 256 hosts (https://github.com/royhills/arp-scan)
192.168.159.1   00:50:56:c0:00:08       VMware, Inc.
192.168.159.2   00:50:56:e6:48:64       VMware, Inc.
192.168.159.182 00:0c:29:8f:1a:6e       VMware, Inc.
192.168.159.254 00:50:56:f5:b4:1d       VMware, Inc.

7 packets received by filter, 0 packets dropped by kernel
Ending arp-scan 1.9.5: 256 hosts scanned in 1.990 seconds (128.64 hosts/sec). 4 responded

A failing run will receive packets but display nothing like this:

rsh@buster:~/arp-scan$ sudo ./arp-scan --localnet
Interface: ens33, type: EN10MB, MAC: 00:0c:29:87:37:f9, IPv4: 192.168.159.155
Starting arp-scan 1.9.5 with 256 hosts (https://github.com/royhills/arp-scan)

11 packets received by filter, 0 packets dropped by kernel
Ending arp-scan 1.9.5: 256 hosts scanned in 2.012 seconds (127.24 hosts/sec). 0 responded
@guyharris
Copy link
Member

guyharris commented Oct 11, 2019

I beleive there may be a regression between libpap versions 1.9.0 and 1.9.1 when using select() on a file descriptor returned by pcap_get_selectable_fd().

Operating system: Linux (problem first reported on Arch Linux 2019.10.01 x64, reproduced on Debian 10 "Buster" x64)
Compiler: gcc

Summary of problem:

arp-scan ( https://github.com/royhills/arp-scan ) uses libpcap to receive ARP response packets. The arp-scan pcap code has worked up to and including libpcap 1.9.0, but fails on libpcap 1.9.1.

No errors are reported, but no ARP response packets are processed. The statistics from pcap_stats() are non-zero, indicating that packets are matching the filter expression.

The key pcap API calls go like this (much simplified):

pcap_handle = pcap_create(if_name, errbuf)
pcap_fd=pcap_get_selectable_fd(pcap_handle)

A quick look at the code seems to indicate that you're setting the timeout to 0:

$ egrep TO_MS *.[ch]
arp-scan.c:      if ((pcap_set_timeout(pcap_handle, TO_MS)) < 0)
arp-scan.h:#define TO_MS 0                              /* Timeout for pcap_open_live() */

To quote the pcap(3PCAP) man page:

   packet buffer timeout
          If,  when  capturing,  packets  are  delivered  as  soon as they
          arrive, the application capturing the packets will be  woken  up
          for  each  packet  as  it arrives, and might have to make one or
          more calls to the operating system to fetch each packet.

          If, instead, packets are not delivered as soon as  they  arrive,
          but  are  delivered after a short delay (called a "packet buffer
          timeout"), more than one packet can be  accumulated  before  the
          packets are delivered, so that a single wakeup would be done for
          multiple packets, and each set of calls made  to  the  operating
          system  would  supply  multiple  packets,  rather  than a single
          packet.  This reduces the per‐packet CPU overhead if packets are
          arriving  at  a  high rate, increasing the number of packets per
          second that can be captured.

          The packet buffer timeout is required  so  that  an  application
          won’t  wait for the operating system’s capture buffer to fill up
          before packets are delivered; if packets  are  arriving  slowly,
          that wait could take an arbitrarily long period of time.

          Not  all platforms support a packet buffer timeout; on platforms
          that don’t, the packet buffer timeout is ignored.  A zero  value
          for the timeout, on platforms that support a packet buffer time‐
          out, will cause a read to wait forever to allow  enough  packets
          to  arrive,  with  no timeout.  A negative value is invalid; the
          result of setting the timeout to  a  negative  value  is  unpre‐
          dictable.

So, if a program is not using non-blocking mode and select()/poll()/whatever, if the program calls pcap_loop(), pcap_dispatch(), pcap_next(), and pcap_next_ex(), and only a single packet, or a small number of packets insufficient to fill the kernel's packet buffer, could wait an indefinitely long period of time before the call wakes up from a sleep, because it might take an indefinitely long period of time before enough packets arrive to fill the buffer.

I suspect that, if the program is using non-blocking mode and select()/poll()/whatever, that means that it could take an indefinitely long period of time before the FD for the pcap_t is marked as readable and the select()/poll()/whatever reports it as such.

Commit 2b16559 was:

Author: Guy Harris <guy@alum.mit.edu>
Date:   Wed Sep 4 20:31:34 2019 -0700

    With a timeout of zero, specify a maximum-size retire timeout.

    A timeout of zero means "wait indefinitely", not "wait for some
    kernel-chosen default block retirement timeout".

That change fixed a bug wherein setting the timeout to 0 did not cause the documented behavior to occur. I can't find the email/GitHub issue/whatever where the problem was reported, but I do remember it having been reported.

I.e., somebody was expecting a zero timeout to cause packets not to be delivered until the buffer fills up, or some such behavior, just as happens on, for example, *BSD and macOS, with BPF as the capture mechanism.

On systems with BPF, arp-scan turns immediate mode on by directly doing an ioctl to do so; in immediate mode, the timeout is irrelevant, and packets are delivered immediately. (And, at least on macOS, even if I #ifdef that code out, it's in non-blocking mode, and timeouts are being done on select() to work around select() not working on BPF devices in some versions of those OSes, so 1) eventually the timeout will cause select() to return and 2) in non-blocking mode, if there's at least one packet available, the read will return it immediately.)

So what arp-scan should do is, if pcap_set_immediate_mode() is available, use it (between the pcap_create() and pcap_activate() calls), instead of manually setting immediate mode with BIOCIMMEDIATE or with SBIOCSTIME, because

  1. on systems with BPF, that will cause BIOCIMMEDIATE to be done and, on Solaris where DLPI is used, that will cause SBIOCSTIME to be done;

  2. on Linux with TPACKET_V3 (i.e., kernels since 3.something), that will cause a fallback to TPACKET_V2, where the timeout doesn't matter, and packets will be delivered immediately.

samba-team-bot pushed a commit to samba-team/samba that referenced this issue Aug 15, 2023
Fix a problem where ctdb_killtcp (almost always) fails to capture
packets with --enable-pcap and libpcap ≥ 1.9.1.  The problem is due to
a gradual change in libpcap semantics when using
pcap_get_selectable_fd(3PCAP) to get a file descriptor and then using
that file descriptor in non-blocking mode.

pcap_set_immediate_mode(3PCAP) says:

  pcap_set_immediate_mode() sets whether immediate mode should be set
  on a capture handle when the handle is activated.  In immediate
  mode, packets are always delivered as soon as they arrive, with no
  buffering.

and

  On Linux, with previous releases of libpcap, capture devices are
  always in immediate mode; however, in 1.5.0 and later, they are, by
  default, not in immediate mode, so if pcap_set_immediate_mode() is
  available, it should be used.

However, it wasn't until libpcap commit
2ade7676101366983bd4f86bc039ffd25da8c126 (before libpcap 1.9.1) that
it became a requirement to use pcap_set_immediate_mode(), even with a
timeout of 0.

More explanation in this libpcap issue comment:

  the-tcpdump-group/libpcap#860 (comment)

Do a configure check for pcap_set_immediate_mode() even though it has
existed for 10 years.  It is easy enough.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=15451

Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Aug 15 10:53:52 UTC 2023 on atb-devel-224
samba-team-bot pushed a commit to samba-team/samba that referenced this issue Aug 30, 2023
Fix a problem where ctdb_killtcp (almost always) fails to capture
packets with --enable-pcap and libpcap ≥ 1.9.1.  The problem is due to
a gradual change in libpcap semantics when using
pcap_get_selectable_fd(3PCAP) to get a file descriptor and then using
that file descriptor in non-blocking mode.

pcap_set_immediate_mode(3PCAP) says:

  pcap_set_immediate_mode() sets whether immediate mode should be set
  on a capture handle when the handle is activated.  In immediate
  mode, packets are always delivered as soon as they arrive, with no
  buffering.

and

  On Linux, with previous releases of libpcap, capture devices are
  always in immediate mode; however, in 1.5.0 and later, they are, by
  default, not in immediate mode, so if pcap_set_immediate_mode() is
  available, it should be used.

However, it wasn't until libpcap commit
2ade7676101366983bd4f86bc039ffd25da8c126 (before libpcap 1.9.1) that
it became a requirement to use pcap_set_immediate_mode(), even with a
timeout of 0.

More explanation in this libpcap issue comment:

  the-tcpdump-group/libpcap#860 (comment)

Do a configure check for pcap_set_immediate_mode() even though it has
existed for 10 years.  It is easy enough.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=15451

Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Aug 15 10:53:52 UTC 2023 on atb-devel-224

(cherry picked from commit dc7b48c)

Autobuild-User(v4-17-test): Jule Anger <janger@samba.org>
Autobuild-Date(v4-17-test): Tue Aug 29 10:29:56 UTC 2023 on sn-devel-184
samba-team-bot pushed a commit to samba-team/samba that referenced this issue Aug 30, 2023
Fix a problem where ctdb_killtcp (almost always) fails to capture
packets with --enable-pcap and libpcap ≥ 1.9.1.  The problem is due to
a gradual change in libpcap semantics when using
pcap_get_selectable_fd(3PCAP) to get a file descriptor and then using
that file descriptor in non-blocking mode.

pcap_set_immediate_mode(3PCAP) says:

  pcap_set_immediate_mode() sets whether immediate mode should be set
  on a capture handle when the handle is activated.  In immediate
  mode, packets are always delivered as soon as they arrive, with no
  buffering.

and

  On Linux, with previous releases of libpcap, capture devices are
  always in immediate mode; however, in 1.5.0 and later, they are, by
  default, not in immediate mode, so if pcap_set_immediate_mode() is
  available, it should be used.

However, it wasn't until libpcap commit
2ade7676101366983bd4f86bc039ffd25da8c126 (before libpcap 1.9.1) that
it became a requirement to use pcap_set_immediate_mode(), even with a
timeout of 0.

More explanation in this libpcap issue comment:

  the-tcpdump-group/libpcap#860 (comment)

Do a configure check for pcap_set_immediate_mode() even though it has
existed for 10 years.  It is easy enough.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=15451

Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Aug 15 10:53:52 UTC 2023 on atb-devel-224

(cherry picked from commit dc7b48c)

Autobuild-User(v4-18-test): Jule Anger <janger@samba.org>
Autobuild-Date(v4-18-test): Tue Aug 29 12:27:35 UTC 2023 on atb-devel-224
samba-team-bot pushed a commit to samba-team/samba that referenced this issue Aug 30, 2023
Fix a problem where ctdb_killtcp (almost always) fails to capture
packets with --enable-pcap and libpcap ≥ 1.9.1.  The problem is due to
a gradual change in libpcap semantics when using
pcap_get_selectable_fd(3PCAP) to get a file descriptor and then using
that file descriptor in non-blocking mode.

pcap_set_immediate_mode(3PCAP) says:

  pcap_set_immediate_mode() sets whether immediate mode should be set
  on a capture handle when the handle is activated.  In immediate
  mode, packets are always delivered as soon as they arrive, with no
  buffering.

and

  On Linux, with previous releases of libpcap, capture devices are
  always in immediate mode; however, in 1.5.0 and later, they are, by
  default, not in immediate mode, so if pcap_set_immediate_mode() is
  available, it should be used.

However, it wasn't until libpcap commit
2ade7676101366983bd4f86bc039ffd25da8c126 (before libpcap 1.9.1) that
it became a requirement to use pcap_set_immediate_mode(), even with a
timeout of 0.

More explanation in this libpcap issue comment:

  the-tcpdump-group/libpcap#860 (comment)

Do a configure check for pcap_set_immediate_mode() even though it has
existed for 10 years.  It is easy enough.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=15451

Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>

Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Tue Aug 15 10:53:52 UTC 2023 on atb-devel-224

(cherry picked from commit dc7b48c)

Autobuild-User(v4-19-test): Jule Anger <janger@samba.org>
Autobuild-Date(v4-19-test): Tue Aug 29 09:34:35 UTC 2023 on atb-devel-224
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants