Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPACKET_V3 support broken when read timeout 0. #335

Closed
LordReg opened this issue Dec 8, 2013 · 15 comments
Closed

TPACKET_V3 support broken when read timeout 0. #335

LordReg opened this issue Dec 8, 2013 · 15 comments

Comments

@LordReg
Copy link

LordReg commented Dec 8, 2013

When libpcap TPACKET_V3 support is in use, knockd does not register udp packet 'knocks' (haven't tested tcp.) Disabling TPACKET_V3 in libpcap by commenting out the relevant #define in pcap-linux.c (so that TPACKET_V2 support is used) works around the issue.

knockd sets the read timeout to 0 (to wait indefinitely) via pcap_open_live():
https://github.com/jvinet/knock/blob/master/src/knockd.c#L217
If changing the read timeout to say, 1000, then knocked will register the udp packets as expected.

Tested libpcap 1.5.1 and git commit 76522d with knockd 0.6 on Linux Kernel 3.12.3.

@guyharris
Copy link
Member

To quote the pcap(3PCAP) man page:

   read timeout
          If, when capturing,  packets  are  delivered  as  soon  as  they
          arrive,  the  application capturing the packets will be woken up
          for each packet as it arrives, and might have  to  make  one  or
          more calls to the operating system to fetch each packet.

          If,  instead,  packets are not delivered as soon as they arrive,
          but are delivered after a short delay (called a "read timeout"),
          more  than  one packet can be accumulated before the packets are
          delivered, so that a single wakeup would be  done  for  multiple
          packets,  and  each  set  of  calls made to the operating system
          would supply multiple packets,  rather  than  a  single  packet.
          This reduces the per-packet CPU overhead if packets are arriving
          at a high rate, increasing the number of packets per second that
          can be captured.

          The  read  timeout is required so that an application won't wait
          for the operating system's capture  buffer  to  fill  up  before
          packets are delivered; if packets are arriving slowly, that wait
          could take an arbitrarily long period of time.

          Not all platforms support a  read  timeout;  on  platforms  that
          don't,  the read timeout is ignored.  A zero value for the time-
          out, on platforms that support a read timeout, will cause a read
          to wait forever to allow enough packets to arrive, with no time-
          out.

"Enough packets to arrive" can mean "enough packets to fill up a buffer", and, in fact, does mean so on, for example, *BSD, OS X, and Solaris. This is presumably why knockd does not use a timeout of 0 on FreeBSD or OS X; it should also not do so on NetBSD, OpenBSD, DragonFly BSD, or Solaris.

This means that "wait forever" can mean "wait a very very very very long time" if packets are arriving slowly - and if they stop arriving at all, it truly can mean "wait forever".

And, with the advent of libpcap 1.5, which uses TPACKET_V3 (which is more BSD BPF-like than earlier versions of the memory-mapped packet capture code) on Linux, "enough packets to arrive" can mean "enough packets to fill up a buffer" on Linux.

I would strongly suggest that knockd not use a timeout of 0 on any platforms, and pick an appropriate timeout value instead. That also means one less #ifdef....

I've opened a knockd issue for this.

@ghost ghost assigned guyharris Dec 8, 2013
@infrastation
Copy link
Member

Does the supposedly larger TPACKET_V3 buffer size cause the difference in behaviour (reporter tells it runs OK when nailed down to TPACKET_V2)?

@guyharris
Copy link
Member

With TPACKET_V1 and _V2, the ring buffer consists of a set of per-packet fixed-length slots, each of which is marked as belonging either to the kernel, in which case the kernel can put a packet into it and hand it to the user (if no slots are marked as belonging to the kernel, the packet is dropped), or to userland, in which case it has a packet in it that the userland code can process and, when done, hand the slot back to the kernel. If there are no slots belonging to userland, it can block in a select() call on the FD for the socket, which will wait until there's at least one slot belonging to userland. The kernel knows nothing about the timeout; it is only used in the select() call; it was presumably put in to match the behavior of BSD, wherein a read() from a BPF device with a timeout completes either after the buffer fills up or the timeout expires, even if no packets have been put in the buffer (if there are no packets in the buffer, read() returns 0).

With TPACKET_V3, the ring buffer consists of a set of fixed-length buffer slots, each of which is marked as belonging either to the kernel or userland. As packets arrive, they are put into a buffer slot belonging to the kernel until there's no room for the new packet in the buffer slot, at which point the buffer slot is handed to userland and the packet is put at the beginning of the next buffer slot belonging to the kernel (or dropped if there are no buffer slots in the ring that belong to the kernel). The kernel does know about the timeout and, if there's a timeout and it expires before a buffer slot fills up, the buffer slot is handed to userland. It looks as if a timeout of 0 is turned into a kernel-selected timeout, so it might not block forever, but it might block for a longer period of time than the caller would want.

TPACKET_V3 works similarly to BSD BPF and to Solaris's bufmod STREAMS, in that it delivers blocks full of packets, rather than individual packets, to its clients, by accumulating packets in a block and delivering the block when it's full or after a timer has expired. In BPF, a timeout of zero means "no timeout", so the block is delivered only when it's full, so packets can take an arbitrarily long time to be delivered.

@infrastation
Copy link
Member

I begin understanding now, thank you!

@LordReg
Copy link
Author

LordReg commented Dec 9, 2013

Even if a timeout of 0 implies a kernel selected timeout, shouldn't the packets still register? pcap_dispatch is continually polled.

@LordReg
Copy link
Author

LordReg commented Dec 9, 2013

From further testing, with read timeout 0, pcap_dispatch appears to block indefinitely until a packet is recieved that passes the set filter, but the return value is 0 and the pcap_handler callback is never called.

@guyharris
Copy link
Member

If libpcap (or anything else using TPACKET_V3) specifies a timeout of 0, the PF_PACKET socket code picks a default timeout. That timeout is likely to be very low, perhaps as low as 1 ms (the minimum).

Whenever the timeout expires, the kernel code hands a block of packets to userland, even if the block is empty. If it expires again before userland gets a chance to process that block and hand it back to the kernel, it'll hand another block to userland. With enough timer expirations before userland wakes up and it'll have handed the entire ring buffer to userland, so that all subsequent packets are dropped by the PF_PACKET code until userland manages to hand the blocks back to the kernel.

If you explicitly specify a low timeout, that timeout is also used in the poll() call, so the process will wake up fairly soon, from the poll() timeout if nothing else, after blocking, and will clean out whatever blocks have been handed to it.

If you specify a timeout of 0, that turns into a low timeout in the kernel, but there's no timeout in poll(), so the process wakes up only when the PF_PACKET socket code wakes it up. That might not be happening soon enough to avoid this problem.

@guyharris
Copy link
Member

It appears that PF_PACKET sockets deliver a wakeup when a packet is put in a buffer block or dropped due to no buffer blocks being empty, but not when a buffer block is handed to userland.

This means that if the kernel's timer expires, and there are no packets in the current buffer block being filled by the kernel, that buffer block will be handed to userland, but userland won't be woken up to tell it to consume that block.

Thus, libpcap will consume that block only if either:

  1. a packet is put in a buffer block, meaning it must pass the filter and there must be a current buffer block, belonging to the kernel, into which to put it;
  2. a packet arrives and passes the filter, but there are no current buffer blocks belonging to the kernel, so it's dropped;
  3. the poll() times out.

So, with a low packet acceptance rate (either because there isn't much network traffic or because there is but most of it is rejected by the packet filter), and with a poll() timeout of -1, meaning "block forever", 1) will happen infrequently, and 3) will never happen. With an in-kernel timeout rate significantly lower than the rate of packet acceptance, the timeout will often occur when there are no packets in the current buffer block, in which case the kernel will hand an empty buffer block to userland and not tell userland about it.

If that happens often enough in sequence to cause all buffer blocks to be handed to userland before any wakeups occur, the kernel now has no buffer blocks into which to put packets, and the next time a packet arrives, it will be dropped, and a wakeup will finally occur. libpcap will drain the ring, handing all buffer blocks to the kernel, but it won't have any packets to process!

So this is ultimately a problem with the TPACKET_V3 code in the kernel. I personally think that it should not deliver empty buffer blocks to userland, and that it also should not deliver a wakeup when a packet is accepted, and should deliver a wakeup whenever a buffer block is handed to userland. I'll report this to somebody and let them decide which of those changes should be done.

The right workaround in libpcap is to use a small timeout in poll() if 0 is specified as the timeout to libpcap. I'll check that in as a way of making a timeout of 0 work less poorly, and will also change the libpcap documentation to say that the behavior, with a timeout of 0, is platform-dependent, and is therefore definitely not what you want in portable code, and that it's probably not even what you want in non-portable or #ifdef-controlled code.

Programs should, if they want to see packets ASAP, use a timeout of 1 (i.e., 1 ms), not 0 (or use "immediate mode" in libpcap 1.5 and later, but I think there's currently a bug with that when libpcap uses TPACKET_V3 - I'll check that and fix it if necessary - but I think using immediate mode and a timeout of 1 ms will mean it won't be too non-immediate with TPACKET_V3). That might get rid of some #ifdefs, as per, for example, the knockd issue I reported.

@guyharris guyharris reopened this Dec 10, 2013
@guyharris
Copy link
Member

Workaround checked in as ee40851.

guyharris added a commit that referenced this issue Dec 11, 2013
@mcr
Copy link
Member

mcr commented Dec 11, 2013

------- Blind-Carbon-Copy

From: Michael Richardson mcr@sandelman.ca
To: tcpdump-workers@lists.tcpdump.org
Subject: Re: [libpcap] TPACKET_V3 support broken when read timeout 0. (#335)
In-Reply-To: the-tcpdump-group/libpcap/issues/335/30280794@github.com
References: the-tcpdump-group/libpcap/issues/335@github.com the-tcpdump-group/libpcap/issues/335/30280794@github.com
X-Mailer: MH-E 8.2; nmh 1.3-dev; GNU Emacs 23.4.1
X-Face: $\n1pF)h^}$H>Hk{L"x@)JS7<%Az}5RyS@k9X%29-lHB$Ti.V>2bi.~ehC0;<'$9xN5Ub# z!G,pnR&p7Fz@^UXIn156S8.~^@mj*mMsD7=QFeq%AL4m<nPbLgmtKK-5dC@#:k
Date: Wed, 11 Dec 2013 08:54:43 -0500
Message-ID: 11444.1386770083@sandelman.ca
Sender: mcr@sandelman.ca

I propose to release a libpcap 1.5.3 this week with the following changes:

  1. if PCAP_LINUX_TPACKET="v3" or ="v2" then the specific method
    will be used chosen, unless:

  2. a new pcap_linux_tpacket(enum method) has been called by the application
    (which having a GUI, I presume) can let the user pick the right answer.

  3. failing this, the default will be v2 for now.

From the thread on this github issue, my impression is that there might
be a logic error (either a mis-understanding, or a bug) in pcap compared to
the kernel, or it may be that some applications have been doing something
slightly wrong, but up to now, it hasn't mattered.

Having the above controls would let people move forward, and let us gather
enough data to figure out what is up, and fix it.


] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

------- End of Blind-Carbon-Copy

@darkjames
Copy link

I'd rather suggest environment variable FORCE_TPACKET_VERSION=1/2/3

@guyharris
Copy link
Member

From the thread on this github issue, my impression is that there might be a logic error (either a mis-understanding, or a bug) in pcap compared to the kernel, or it may be that some applications have been doing something slightly wrong, but up to now, it hasn't mattered.

There is, in the TPACKET_V3 code in the kernel, something I consider at best a misfeature, namely that the kernel delivers empty blocks to userland but doesn't wake userland up for every empty block passed up. (I'll be asking either linux-netdev or Chetan Loke about this, suggesting that, ideally, it shouldn't deliver empty blocks to userland or, if there's a compelling reason why it should do so, it should at least wake userland up when it does so.)

Before I committed ee40851, that meant that, if you opened a capture device with a timeout value of 0, and TPACKET_V3 was being used, and traffic was arriving sufficiently slower than one packet per millisecond, when a packet arrived, there would be a good chance that all ring-buffer blocks belonged to userland, so the packet would get dropped on the floor (dropping the packet on the floor would wake up userland, which would then return the ring-buffer blocks to the kernel, but by that time it's too late).

The change in ee40851 causes the poll() used to wait for packets to have a timeout of 1 millisecond if the timeout value is 0; this means that userland wakes up much more often, and thus returns the ring-buffer blocks to the kernel much more often.

So I think this particular bug is fixed to the extent that there should be a lot fewer packet drops in the case where the timeout value is 0.

If a timeout value of 0 is not being used, this bug doesn't show up and, at least in the tests Gabor Tatarka did, with high packet rates, and with dumpcap and tcpdump, neither of which specify a timeout value of 0, there were significantly fewer packet drops with TPACKET_V3 than with TPACKET_V2.

So, with the fix (which I propagated to the 1.5 branch), I don't think this particular problem is reason enough to revert to using TPACKET_V2 by default, and requiring the user or programmer to explicitly request TPACKET_V3.

However, there are two other problems, namely #331 and #333, which are not yet fixed. If you think we should default to TPACKET_V2 until those issues are fixed, that might be reasonable, although I'd prefer not to add an explicit API to override it; eventually, that API should go away, once we have the bug fixed. I'd suggest just having an environment variable, either your suggestion of PCAP_LINUX_TPACKET or Jakub's suggestion of FORCE_TPACKET_VERSION.

@mcr
Copy link
Member

mcr commented Dec 12, 2013

Guy Harris notifications@github.com wrote:
> However, there are two other problems, namely #331 and #333, which are not yet
> fixed. If you think we should default to TPACKET_V2 until those issues are
> fixed, that might be reasonable, although I'd prefer not to add an explicit API
> to override it; eventually, that API should go away, once we have the bug
> fixed. I'd suggest just having an environment variable, either your suggestion
> of PCAP_LINUX_TPACKET or Jakub's suggestion of FORCE_TPACKET_VERSION.

The API could go away, I agree.

I chose PCAP_LINUX specifically because it shows which driver is involved.
My goal is to get more data, and to hedge our bets.

After we fix #331/#333, I suggest we make V3 the default again.
I think that V3 benefits those to are doing higher performance captures, and
they are therefore have more incentive to help us debug this :-)

] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

gregnietsky pushed a commit to Distrotech/libpcap that referenced this issue Dec 18, 2013
@guyharris
Copy link
Member

OK, I think we have fixes for all the outstanding TPACKET_V3 issues, checked into the trunk and the 1.5 branch.

@infrastation
Copy link
Member

For posterity, there is now a FAQ entry about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants