New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPACKET_V3 support broken when read timeout 0. #335
Comments
To quote the pcap(3PCAP) man page:
"Enough packets to arrive" can mean "enough packets to fill up a buffer", and, in fact, does mean so on, for example, *BSD, OS X, and Solaris. This is presumably why knockd does not use a timeout of 0 on FreeBSD or OS X; it should also not do so on NetBSD, OpenBSD, DragonFly BSD, or Solaris. This means that "wait forever" can mean "wait a very very very very long time" if packets are arriving slowly - and if they stop arriving at all, it truly can mean "wait forever". And, with the advent of libpcap 1.5, which uses TPACKET_V3 (which is more BSD BPF-like than earlier versions of the memory-mapped packet capture code) on Linux, "enough packets to arrive" can mean "enough packets to fill up a buffer" on Linux. I would strongly suggest that knockd not use a timeout of 0 on any platforms, and pick an appropriate timeout value instead. That also means one less I've opened a knockd issue for this. |
Does the supposedly larger TPACKET_V3 buffer size cause the difference in behaviour (reporter tells it runs OK when nailed down to TPACKET_V2)? |
With TPACKET_V1 and _V2, the ring buffer consists of a set of per-packet fixed-length slots, each of which is marked as belonging either to the kernel, in which case the kernel can put a packet into it and hand it to the user (if no slots are marked as belonging to the kernel, the packet is dropped), or to userland, in which case it has a packet in it that the userland code can process and, when done, hand the slot back to the kernel. If there are no slots belonging to userland, it can block in a With TPACKET_V3, the ring buffer consists of a set of fixed-length buffer slots, each of which is marked as belonging either to the kernel or userland. As packets arrive, they are put into a buffer slot belonging to the kernel until there's no room for the new packet in the buffer slot, at which point the buffer slot is handed to userland and the packet is put at the beginning of the next buffer slot belonging to the kernel (or dropped if there are no buffer slots in the ring that belong to the kernel). The kernel does know about the timeout and, if there's a timeout and it expires before a buffer slot fills up, the buffer slot is handed to userland. It looks as if a timeout of 0 is turned into a kernel-selected timeout, so it might not block forever, but it might block for a longer period of time than the caller would want. TPACKET_V3 works similarly to BSD BPF and to Solaris's bufmod STREAMS, in that it delivers blocks full of packets, rather than individual packets, to its clients, by accumulating packets in a block and delivering the block when it's full or after a timer has expired. In BPF, a timeout of zero means "no timeout", so the block is delivered only when it's full, so packets can take an arbitrarily long time to be delivered. |
I begin understanding now, thank you! |
Even if a timeout of 0 implies a kernel selected timeout, shouldn't the packets still register? pcap_dispatch is continually polled. |
From further testing, with read timeout 0, pcap_dispatch appears to block indefinitely until a packet is recieved that passes the set filter, but the return value is 0 and the pcap_handler callback is never called. |
If libpcap (or anything else using TPACKET_V3) specifies a timeout of 0, the PF_PACKET socket code picks a default timeout. That timeout is likely to be very low, perhaps as low as 1 ms (the minimum). Whenever the timeout expires, the kernel code hands a block of packets to userland, even if the block is empty. If it expires again before userland gets a chance to process that block and hand it back to the kernel, it'll hand another block to userland. With enough timer expirations before userland wakes up and it'll have handed the entire ring buffer to userland, so that all subsequent packets are dropped by the PF_PACKET code until userland manages to hand the blocks back to the kernel. If you explicitly specify a low timeout, that timeout is also used in the If you specify a timeout of 0, that turns into a low timeout in the kernel, but there's no timeout in |
It appears that PF_PACKET sockets deliver a wakeup when a packet is put in a buffer block or dropped due to no buffer blocks being empty, but not when a buffer block is handed to userland. This means that if the kernel's timer expires, and there are no packets in the current buffer block being filled by the kernel, that buffer block will be handed to userland, but userland won't be woken up to tell it to consume that block. Thus, libpcap will consume that block only if either:
So, with a low packet acceptance rate (either because there isn't much network traffic or because there is but most of it is rejected by the packet filter), and with a If that happens often enough in sequence to cause all buffer blocks to be handed to userland before any wakeups occur, the kernel now has no buffer blocks into which to put packets, and the next time a packet arrives, it will be dropped, and a wakeup will finally occur. libpcap will drain the ring, handing all buffer blocks to the kernel, but it won't have any packets to process! So this is ultimately a problem with the TPACKET_V3 code in the kernel. I personally think that it should not deliver empty buffer blocks to userland, and that it also should not deliver a wakeup when a packet is accepted, and should deliver a wakeup whenever a buffer block is handed to userland. I'll report this to somebody and let them decide which of those changes should be done. The right workaround in libpcap is to use a small timeout in Programs should, if they want to see packets ASAP, use a timeout of 1 (i.e., 1 ms), not 0 (or use "immediate mode" in libpcap 1.5 and later, but I think there's currently a bug with that when libpcap uses TPACKET_V3 - I'll check that and fix it if necessary - but I think using immediate mode and a timeout of 1 ms will mean it won't be too non-immediate with TPACKET_V3). That might get rid of some |
Workaround checked in as ee40851. |
------- Blind-Carbon-Copy From: Michael Richardson mcr@sandelman.ca I propose to release a libpcap 1.5.3 this week with the following changes:
From the thread on this github issue, my impression is that there might Having the above controls would let people move forward, and let us gather ] Never tell me the odds! | ipv6 mesh networks [ ------- End of Blind-Carbon-Copy |
I'd rather suggest environment variable FORCE_TPACKET_VERSION=1/2/3 |
There is, in the TPACKET_V3 code in the kernel, something I consider at best a misfeature, namely that the kernel delivers empty blocks to userland but doesn't wake userland up for every empty block passed up. (I'll be asking either linux-netdev or Chetan Loke about this, suggesting that, ideally, it shouldn't deliver empty blocks to userland or, if there's a compelling reason why it should do so, it should at least wake userland up when it does so.) Before I committed ee40851, that meant that, if you opened a capture device with a timeout value of 0, and TPACKET_V3 was being used, and traffic was arriving sufficiently slower than one packet per millisecond, when a packet arrived, there would be a good chance that all ring-buffer blocks belonged to userland, so the packet would get dropped on the floor (dropping the packet on the floor would wake up userland, which would then return the ring-buffer blocks to the kernel, but by that time it's too late). The change in ee40851 causes the So I think this particular bug is fixed to the extent that there should be a lot fewer packet drops in the case where the timeout value is 0. If a timeout value of 0 is not being used, this bug doesn't show up and, at least in the tests Gabor Tatarka did, with high packet rates, and with dumpcap and tcpdump, neither of which specify a timeout value of 0, there were significantly fewer packet drops with TPACKET_V3 than with TPACKET_V2. So, with the fix (which I propagated to the 1.5 branch), I don't think this particular problem is reason enough to revert to using TPACKET_V2 by default, and requiring the user or programmer to explicitly request TPACKET_V3. However, there are two other problems, namely #331 and #333, which are not yet fixed. If you think we should default to TPACKET_V2 until those issues are fixed, that might be reasonable, although I'd prefer not to add an explicit API to override it; eventually, that API should go away, once we have the bug fixed. I'd suggest just having an environment variable, either your suggestion of PCAP_LINUX_TPACKET or Jakub's suggestion of FORCE_TPACKET_VERSION. |
Guy Harris notifications@github.com wrote: The API could go away, I agree. I chose PCAP_LINUX specifically because it shows which driver is involved. After we fix #331/#333, I suggest we make V3 the default again. ] Never tell me the odds! | ipv6 mesh networks [ |
OK, I think we have fixes for all the outstanding TPACKET_V3 issues, checked into the trunk and the 1.5 branch. |
For posterity, there is now a FAQ entry about this. |
When libpcap TPACKET_V3 support is in use, knockd does not register udp packet 'knocks' (haven't tested tcp.) Disabling TPACKET_V3 in libpcap by commenting out the relevant #define in pcap-linux.c (so that TPACKET_V2 support is used) works around the issue.
knockd sets the read timeout to 0 (to wait indefinitely) via pcap_open_live():
https://github.com/jvinet/knock/blob/master/src/knockd.c#L217
If changing the read timeout to say, 1000, then knocked will register the udp packets as expected.
Tested libpcap 1.5.1 and git commit 76522d with knockd 0.6 on Linux Kernel 3.12.3.
The text was updated successfully, but these errors were encountered: