Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With a timeout of 0, pcap_next_ex()/pcap_dispatch() can return 0 #350

Closed
msekletar opened this issue Apr 11, 2014 · 11 comments
Closed

With a timeout of 0, pcap_next_ex()/pcap_dispatch() can return 0 #350

msekletar opened this issue Apr 11, 2014 · 11 comments

Comments

@msekletar
Copy link
Contributor

I'd like to know if this is expected? See simple reproducer here [1]. Either way, this is change in behaviour compared to version of libpcap using TPACKET_V2.

When running reproducer like "./repro icmp" strange is that epoll_wait() returns EPOLLIN right away even tough we capture from "lo" and there is nothing sending icmp packets. If I understand kernel code correctly, when there are no packets then there shouldn't be any block in the ring dispatched to userspace thus epoll_wait() should not return EPOLLIN.

If you run reproducer and then launch ping on separate terminal you will see output indicating that sometimes pcap_next_ex() returns normally other time it returns 0 indicating timeout.

[1] https://gist.github.com/msekletar/10465629

@guyharris
Copy link
Member

I'd certainly expect it on a multi-platform program - BPF's timer, if it works with select()/poll()/kqueues, is started when the select()/poll() is done, and if it expires, a wakeup is delivered even if there are no packets available. So programs should not assume that select()/poll()/epoll_wait()/kqueues/etc. will only say "this descriptor is readable" if there are packets to read; they should assume it's OK to call pcap_dispatch() or pcap_net_ex(), although they should put the pcap_t in non-blocking mode if they're using select()/poll()/epoll_wait()/kqueues/etc. (They should also be aware that some older versions of *BSD/OS X don't support select()/poll()/epoll_wait()/kqueues/etc. well with BPF devices.)

On Linux, this might be a consequence of TPACKET_V3 being more BPF-like, or it might be a consequence of some misfeatures of the kernel code that I discovered when looking at another bug. I'll look into that later.

@guyharris
Copy link
Member

The other bug, referred to above, was #335.

Here's some stuff taken from that bug (and rewritten to remove details relevant only to that bug, in which a timeout of 0 was specified):

With TPACKET_V3, the ring buffer consists of a set of fixed-length buffer slots, each of which is marked as belonging either to the kernel or userland. As packets arrive, they are put into a buffer slot belonging to the kernel until there's no room for the new packet in the buffer slot, at which point the buffer slot is handed to userland and the packet is put at the beginning of the next buffer slot belonging to the kernel (or dropped if there are no buffer slots in the ring that belong to the kernel). The kernel does know about the timeout and, if there's a timeout and it expires before a buffer slot fills up, the buffer slot is handed to userland, even if the block is empty, which, as noted, is BPF-like behavior.

The timer runs continuously - it's not started only if a process blocks in select()/poll()/epoll_wait() on the socket - so, if it expires again before userland gets a chance to process that block and hand it back to the kernel, it'll hand another block to userland. With enough timer expirations before userland wakes up and it'll have handed the entire ring buffer to userland, so that all subsequent packets are dropped by the PF_PACKET code until userland manages to hand the blocks back to the kernel.

It appears that PF_PACKET sockets deliver a wakeup when a packet is put in a buffer block or dropped due to no buffer blocks being empty, but not when a buffer block is handed to userland.

This means that if the kernel's timer expires, and there are no packets in the current buffer block being filled by the kernel, that buffer block will be handed to userland, but userland won't be woken up to tell it to consume that block.

Thus, libpcap will consume that block only if either:

  1. a packet is put in a buffer block, meaning it must pass the filter and there must be a current buffer block, belonging to the kernel, into which to put it;
  2. a packet arrives and passes the filter, but there are no current buffer blocks belonging to the kernel, so it's dropped;
  3. the poll() inside the read routine for TPACKET_V3 sockets times out.

So, with a low packet acceptance rate (either because there isn't much network traffic or because there is but most of it is rejected by the packet filter), and with a poll() timeout of -1, meaning "block forever", 1) will happen infrequently, and 3) will never happen. With an in-kernel timeout rate significantly lower than the rate of packet acceptance, the timeout will often occur when there are no packets in the current buffer block, in which case the kernel will hand an empty buffer block to userland and not tell userland about it.

If that happens often enough in sequence to cause all buffer blocks to be handed to userland before any wakeups occur, the kernel now has no buffer blocks into which to put packets, and the next time a packet arrives, it will be dropped, and a wakeup will finally occur. libpcap will drain the ring, handing all buffer blocks to the kernel, but it won't have any packets to process!

So #335 is ultimately a problem with the TPACKET_V3 code in the kernel. I personally think that it should not deliver empty buffer blocks to userland, and that it also should not deliver a wakeup when a packet is accepted, and should deliver a wakeup whenever a buffer block is handed to userland. I'll report this to somebody, at some point, and let them decide which of those changes should be done.

@msekletar
Copy link
Contributor Author

Hmm...I am not sure I grok this. I read #335 before reporting the issue, but I was thinking that because 3) should not happen then I'd expect that when EPOLLIN is delivered then subsequent call to pcap_next_ex() would return an error rather than timeout (assuming no packets are arriving yet). Even tough that returning an error is change in behaviour.

Is my assumption correct or not?

Still trying to wrap my head around differences between TPACKET_V2 vs TPACKET_V3, so please bare with me. Thanks for looking into this!

@guyharris
Copy link
Member

because 3) should not happen

Why not? The timer in poll() is independent of the timer in the TPACKET_V3 code (or in BPF on systems using BPF).

I'd expect that when EPOLLIN is delivered then subsequent call to pcap_next_ex() would return an error

"No packets are available right now" isn't an error; an error is "that network interface went down" or something such as that, indicating that you might want to report something to the user and perhaps stop capturing on that interface. "No packets are available right now" means "keep trying - wait for the next wakeup you get".

@msekletar
Copy link
Contributor Author

Why not? The timer in poll() is independent of the timer in the TPACKET_V3 code (or in BPF on systems using BPF).

How can I get timeout? When app either calls poll() itself with timeout set to -1 or calls pcap_next_ex() which will block if there is no packet available, because again, poll() is called with timeout set to -1.

Previous sentence is written from perspective of app developer reading a manpage, but now we have this hack for TPACKET_V3 with poll() timeout set to 1 even tough blocking indefinitely is requested. When packets are received infrequently then the most common return value from pcap_next_ex() is timeout which is unexpected when fd is supposed to be blocking.

I get that kernel interface might not be perfect, but current behaviour breaks applications which is unacceptable. TBH until we hash out all the details of TPACKET_V3 support in libpcap I am tempted to recompile for EL7 and Fedora and fallback to TPACKET_V2 for the time being.

@guyharris
Copy link
Member

There's more than one timeout being discussed here.

The timeout in libpcap is only guaranteed to, on platforms where the capture mechanism accumulates batches of packets and delivers the entire batch at once, rather than delivering packets individually, to reduce the overhead of capturing, prevent the capture mechanism from waiting indefinitely for the buffer for the packet batch to fill up before delivering packets.

It is not guaranteed to time out if the batch is empty (and, in fact, it doesn't do so on SunOS 4.x or, with DLPI, on SunOS 5.x, as I remember), and it is not guaranteed not to time out if the batch is empty (and it does time out if the batch is empty on BPF and on TPACKET_V3). It is not guaranteed to do anything on platforms where there's no batching.

It should not be used for any purposes other than 1) ensuring that you eventually see packets or 2) working around capture mechanism/libpcap bugs.

I know of no real use for a timeout of 0; if network traffic that passes the filter is sufficiently rare, it could take an arbitrarily large amount of time for an application that supplies a timeout of 0 to see any packets, where "arbitrarily large amount of time" could mean "the machine reboots before you see any packets", even on embedded systems that rarely if ever reboot.

@guyharris
Copy link
Member

About the only thing we can do here is, if the timeout is 0 and either pcap_get_ring_frame() returns NULL or the packet-reading loop returned no packets, have pcap_read_linux_mmap_v3() loop back to the beginning and call pcap_wait_for_frames_mmap() again.

@guyharris guyharris changed the title epoll_wait() returns EPOLLIN and next call to pcap_next_ex() might return timeout With a timeout of 0, pcap_next_ex()/pcap_dispatch() can return 0 Apr 15, 2014
@msekletar
Copy link
Contributor Author

About the only thing we can do here is, if the timeout is 0 and either pcap_get_ring_frame() returns NULL or the packet-reading loop returned no packets, have pcap_read_linux_mmap_v3() loop back to the beginning and call pcap_wait_for_frames_mmap() again.

I think that might help.

@guyharris
Copy link
Member

OK, so that's what I did in 9e35faa.

guyharris added a commit that referenced this issue Apr 16, 2014
If the user specified a timeout value of 0, and we don't have any
packets to return from pcap_read_linux_mmap_v3(), go back and wait for
packets.  Yes, this means waiting indefinitely, but that's what the
documentation used to say, and what we're making it say again; if this
is a problem, *don't use a timeout value of 0*.

This should fix GitHub libpcap issue #350.
@msekletar
Copy link
Contributor Author

I can verify that libpcap now works as expected. Thanks!

@infrastation
Copy link
Member

For posterity, there is now a FAQ entry about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants