pcap_set_fanout function for PACKET_FANOUT support on linux platform #674

xpahos · 2018-02-10T02:58:45Z

For multithreaded applications, it will be useful to add PACKET_FANOUT support.

Tests for n cpu thread per application without FANOUT, only RX_RING:

3412458 pps
3432391 pps

Test for n cpu thread per application with FANOUT and RX_RING:

8813457 pps
8932582 pps

More information about PACKET_FANOUT and examples https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt

guyharris · 2018-02-10T03:16:53Z

pcap_set_fanout.3pcap

+.\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
+.\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
+.\"
+.TH PCAP_SET_FANTOUT 3PCAP "10 February 2018"


"FANOUT", not "FANTOUT".

guyharris · 2018-02-10T03:17:31Z

pcap-linux.c

@@ -6945,6 +6950,26 @@ pcap_set_protocol(pcap_t *p, int protocol)
 	return (0);
 }

+int
+


No need for an extra blank line here.

guyharris · 2018-02-10T03:19:07Z

pcap-linux.c

@@ -206,6 +206,11 @@
 # endif /* PCAP_SUPPORT_PACKET_RING */
 #endif /* PF_PACKET */

+ /* check if kernel supports fanout for socket */
+# ifdef PACKET_FANOUT
+#  define HAVE_FANOUT


As this is a Linux-only file, wouldn't it be sufficient to just test PACKET_FANOUT in the #ifdefs, rather than defining a new symbol?

guyharris · 2018-02-10T03:24:39Z

pcap/pcap.h

@@ -343,6 +343,7 @@ PCAP_API const char *pcap_tstamp_type_val_to_description(int);

 #ifdef __linux__
 PCAP_API int	pcap_set_protocol(pcap_t *, int);
+PCAP_API int    pcap_set_fanout(pcap_t *, int, int);


I think there's one extra space between int and the function name, so that the names don't line up.

guyharris · 2018-02-10T03:26:34Z

pcap_set_fanout.3pcap

+.B pcap_set_protocol()
+is used for forming sockets in fanout groups. Each received
+packet will be scheduled to only one socket from this group.
+More information about scheduling policies could be found in the 


"...can be found in packet(7)", rather than "...could be found in the packet(7)".

guyharris · 2018-02-10T03:29:04Z

pcap_set_fanout.3pcap

+int pcap_set_fanout(pcap_t *p, int flags, int group_id);
+.ft
+.fi
+.SH DESCRIPTION


This section should use the same language that the pcap_set_protocol(3pcap) page does to indicate that 1) this is Linux-specific, and the function isn't even available on other platforms and 2) it only affects network interfaces, not other devices.

guyharris · 2018-02-10T03:35:18Z

pcap-linux.c

@@ -6945,6 +6950,26 @@ pcap_set_protocol(pcap_t *p, int protocol)
 	return (0);
 }

+int
+
+pcap_set_fanout(pcap_t *handle, int flags, int group_id)


flags is actually a type and flags, as least as I read packet_setsockopt() and fanout_add() in af_packet.c. Either it should be called type_flags, or there should be three arguments - type, flags, and group_id - which are combined in the argument to setsockopt().

guyharris · 2018-02-10T03:37:15Z

pcap_set_fanout.3pcap

+.ft
+.fi
+.SH DESCRIPTION
+.B pcap_set_protocol()


This should indicate what the second and third arguments - or the second, third, and fourth arguments if we go with separate type and flags arguments - do.

guyharris · 2018-02-10T03:43:29Z

So presumably you'd have multiple processes or threads opening the same interface and joining the same fanout group, and different threads/processes using the resulting pcap_ts.

guyharris · 2018-02-10T04:17:03Z

Given that this is Linux-specific, perhaps, instead, we should add pcap_getsockopt(), pcap_setsockopt(), and pcap_ioctl() calls, which are similar to the underlying system calls with the file descriptor argument replaced by a pcap_t * argument. That would allow programs for particular platforms to do platform-specific operations on the underlying BPF device/socket/whatever. (A program could use pcap_fileno() and do the operation on the resulting FD, but this 1) makes it a bit more obvious that you can do those operations and 2) lets libpcap return a "sorry, not supported" error if there is no FD for the pcap_t.)

This would be an "escape hatch" for UN*X similar to the pcap_oid_get_request() and pcap_oid_set_request() calls on Windows.

xpahos · 2018-02-10T12:03:31Z

@guyharris ok, I will close this pull request. For pcap_getsockopt(), pcap_setsockopt(), and pcap_ioctl() I will and wrappers in the next pull request. Something like this:

int pcap_getsockopt(pcap_t *handle, int level, int optname, const void *optval, socklen_t optlen) {
    if(setsockopt(handle->fd, int level, int optname, const void *optval, socklen_t optlen) < 0) {
        pcap_fmt_errmsg_for_errno(handle->errbuf, PCAP_ERRBUF_SIZE,
            errno, "can't change settings");
        return -1;
    }
    return 0;
}

Is it good enough? Or add more detailed error handling in this functions?

guyharris · 2018-02-12T01:42:53Z

Given that there are other capture mechanisms that allow packets from a single source to be distributed amongst multiple readers, another possibility might be to provide a standard API to support all of them.

@luigirizzo: It looks as if netmap can do that, as per your USENIX paper:

4.2.2 Multi-queue interfaces

For cards with multiple ring pairs, file descriptors (and the related ioctl() and poll()) can be configured in one of two modes, chosen through the ring id field in the argument of the NIOCREG ioctl(). In the default mode, the file descriptor controls all rings, causing the kernel to check for available buffers on any of them. In the alternate mode, a file descriptor is associated to a single TX/RX ring pair. This way multiple threads/processes can create separate file descriptors, bind them to different ring pairs, and operate independently on the card without interference or need for synchronization. Binding a thread to a specific core just requires a standard OS system call, setaffinity(), without the need of any new mechanism.

@myri: Can the API for the Myricom cards support multiple threads reading packets from the same interface, and, if so, can separate pcap_ts for the same interface be established for this purpose?

@sfd: Can that be done with DAG cards and the DAG API?

@ntop: Can PF_RING support that?

Note that Linux AF_PACKET sockets support multiple algorithms for distributing packets across sockets, with PACKET_FANOUT_FLAG_ROLLOVER being, arguably, another part of the algorithm (with space of algorithms as the Cartesian product of PACKET_FANOUT_ algorithm and rollover enabled/disabled), and with an option to fanout raw packets or reassembled fragmented IP datagrams. Not all of those would apply to other mechanisms - the core part of the libpcap API enhancement would be to have multiple opens of the same interface somehow bound together as a fanout group.

sfd · 2018-02-12T02:41:24Z

For DAG:
A DAG card supports one or more capture interfaces, and one or more capture streams (rings). By default captured packet records from all interfaces go to one stream, however records can be steered to streams based on interface, flow load-balancing, filters, or any combination of the above.
Currently we expose each stream as a separate libpcap 'device', e.g. dag0:0, dag0:2, dag0:4.

Only one application can attach to each device at once, e.g. they are single-reader. This means multiple applications/workers can read load-balanced packets from a single interface, but they can't read the same packets.

If there was an API in libpcap for supporting multiple load-balanced streams/rings/queues within a single 'device/interface' we could probably support it.

guyharris · 2018-02-12T02:55:11Z

This means multiple applications/workers can read load-balanced packets from a single interface, but they can't read the same packets.

I.e., any given packet will be only seen by one worker - the other workers won't see it? That seems to be the same sort of fanout that PACKET_FANOUT/etc. provide.

sfd · 2018-02-12T03:01:53Z

Correct, a current limitation is that only one 'reader' can attach to a stream at once, so packets are only seen by one worker. This is likely similar to other load-balancing mechanisms, RSS etc.
(Duplication of packet records to multiple streams is supported by DAG in hardware, but it eats up bus bandwidth so only advised under narrow filters.)

vmaffione · 2018-02-12T07:55:44Z

Yes, in netmap it is possible to bind each RX ring (of the same interface) to a different reader thread.
Obviously each reader sees different packets (no duplication), and you need to have a multiqueue interface.
If you want duplication and/or you don't have a multiqueue interface and/or you want to use more readers than RX rings, you can use the lb program (see man lb) after installing netmap.

xpahos · 2018-02-12T15:25:12Z

As I found in documentation for Myricom API(https://s3.amazonaws.com/hpp-cspi-sdrive/a0ij0000005xHcdAAE%2FSNFv3_API_Reference_Manual+%282%29.pdf?response-content-disposition=attachment%3Bfilename*%3DUTF-8%27%27SNFv3_API_Reference_Manual%2520%25282%2529.pdf&AWSAccessKeyId=AKIAJCCINNC6VUDONGUA&Expires=2112637986&Signature=eI5JrhAIdYVlTCvHuFtyYY0gS8o%3D) libpcap already has a multithreaded support:

5.3.2.2 #define SNF_F_PSHARED 0x1
Device can be process-sharable. This allows multiple independent processes to share rings on the capturing device.
This option can be used to design a custom capture solution but is also used in libpcap when multiple rings are
requested. In this scenario, each libpcap device sees a fraction of the traffic if multiple rings are used unless the
SNF_F_RX_DUPLICATE option is used, in which case each libpcap device sees the same incoming packets.
5.3.2.3 #define SNF_F_RX_DUPLICATE 0x300
Device can duplicate packets to multiple rings as opposed to applying RSS in order to split incoming packets across
rings. Users should be aware that with N rings opened, N times the link bandwidth is necessary to process incoming
packets without drops. The duplication happens in the host rather than the NIC, so while only up to 10Gbits of traffic
crosses the PCIe, N times that bandwidth is necessary on the host.
When duplication is enabled, RSS options are ignored since every packet is delivered to every ring.

It can be configured by using environment variables like SNF_FLAGS, SNF_NUM_RINGS.

guyharris · 2018-04-07T03:43:16Z

By default captured packet records from all interfaces go to one stream, however records can be steered to streams based on interface, flow load-balancing, filters, or any combination of the above.

So can that be done using the DAG API, or do you have to configure the DAG card with the dagconfig command?

sfd · 2018-04-07T06:35:00Z

On 7/04/2018, at 4:43 AM, Guy Harris <notifications@github.com<mailto:notifications@github.com>> wrote: By default captured packet records from all interfaces go to one stream, however records can be steered to streams based on interface, flow load-balancing, filters, or any combination of the above. So can that be done using the DAG API, or do you have to configure the DAG card with the dagconfig command? It can be done via C API calls. Libpcap could potentially determine that load balancing was already configured, or it could configure it on request. I would be happy to contribute, just need to know where to start. Stephen

guyharris · 2018-04-07T19:18:39Z

So my initial idea for APIs to control fanout are:

To open the first pcap_t for a group, you do:

pd = pcap_create(name, errbuf);
if (pd == NULL)
    fail;
//
// id is either PCAP_FANOUT_GROUP_NEW or a number specified by the program.
// Linux supports numbers between 0 and 65535 and, in newer kernels, supports
// PCAP_FANOUT_GROUP_NEW, which causes the OS to pick an unused group.
// If your device type doesn't allow a device to belong to multiple groups, such that,
// for example, a given DAG device would have more than one set of streams, perhaps
// 0 and PCAP_FANOUT_GROUP_NEW should be supported, and return an error with
// PCAP_FANOUT_GROUP_NEW if there's already a group of streams in use.
// The goal is to allow programs to support multiple interface types using either 0
// (if they have to run on older Linux systems) or PCAP_FANOUT_GROUP_NEW (if
// they only have to run on Linux systems with PACKET_FANOUT_FLAG_UNIQUEID,
// added in an April 2017 commit).
// 
if (pcap_set_fanout_group(pd, id) == PCAP_ERROR)
    fail;
//
// The policy is one of:
// PCAP_POLICY_DEFAULT
// PCAP_POLICY_HASH - hash based on addresses/ports/etc. in the packet
// PCAP_POLICY_LB - round-robin load-balancing
// PCAP_POLICY_CPU - send to CPU on which the packet arrived
// PCAP_POLICY_ROLLOVER - send to a single stream until it backs up. then
// move on to the next stream (until it backs up, etc.)
// PCAP_POLICY_RND - use a (pseudo-)random number generator
// PCAP_POLICY_QM - "selects the socket using the recorded queue_mapping
// of the received skb", to quote the Linux documentation
//
// with PCAP_POLICY_DEFAULT being supported by all devices
// (picking the default policy if there's more than one, picking
// the *only* policy if there's only one) and the others corresponding
// to the Linux policies in question.  Additional policies can be added
// if any device offers them; no device is required to support anything
// other than PCAP_POLICY_DEFAULT, so code that supports multiple
// device types should use only PCAP_POLICY_DEFAULT.
//
// The policy options are a bitset, with:
//
// PCAP_POLICY_DEFRAG - send all fragments of a fragmented datagram
// to the same stream (corresponds to the Linux option)
//
// PCAP_POLICY_ROLLOVER - fall back on rollover if the stream selected
// by the policy is backlogged
//
// for all policies (none of which need to be supported by a device), and
//
// PCAP_POLICY_HASH_DEFAULT - default hash
// PCAP_POLICY_HASH_IP - include IPv4/v6 addresses in the hash
// PCAP_POLICY_HASH_SRC_PORT - include source TCP/UDP/etc. port
// PCAP_POLICY_HASH_DST_PORT - include destination TCP/UDP/etc. port
// PCAP_POLICY_HASH_GTP_TEID - include GTP TEID
// PCAP_POLICY_HASH_GRE - include stuff from the GRE header
//
// for PCAP_POLICY_HASH, with no requirement to support anything other
// than PCAP_POLICY_HASH_DEFAULT.  (Those options are from the Myricom
// Sniffer documentation.  More can be added as needed for other devices.)
//
// PCAP_POLICY_DEFAULT is 0, and must be supported by all policies on all
// devices.  PCAP_POLICY_HASH_DEFAULT is also 0, so you can pass
// PCAP_POLICY_DEFAULT|PCAP_POLICY_HASH_DEFAULT for PCAP_POLICY_HASH
// and it is guaranteed to be supported if PCAP_POLICY_HASH is supported.
// 
if (pcap_set_fanout_policy(pd, policy, policy_options) == PCAP_ERROR)
    fail;
status = pcap_activate(pd);
if (status indicates failure)
    fail;

//
// Get the group ID assigned, if we used PCAP_FANOUT_GROUP_NEW.
//
group_id = pcap_get_fanout_group(pd);

and then, for all other pcap_ts in the group, do:

pd = pcap_create(name, errbuf);    // same interface name
if (pd == NULL)
    fail;
if (pcap_set_fanout_group(pd, group_id) == PCAP_ERROR)
    fail;
// same policy and options required
if (pcap_set_fanout_policy(pd, policy, policy_options) == PCAP_ERROR)
    fail;
status = pcap_activate(pd);
if (status indicates failure)
    fail;

Another possibility is to have another object of type pcap_fanout_group_t - create that with a given ID or with PCAP_FANOUT_GROUP_NEW, and with a given policy and options, and then have pcap_set_fanout_group() take a pointer to the pcap_fanout_group_t as an argument. On Linux, pcap_activate() would have to set the ID after opening the device and setting its fanout properties, so that other devices join the same group.

So could something such as that be made to work?

For splitting pcap_ts across processes rather than threads, there would have to be a way to get the group ID assigned at activate time, so that the other processes can have their pcap_ts join the same group.

sfd · 2018-04-09T07:22:49Z

I like the idea of having a PCAP_POLICY_DEFAULT which is always implemented, even if the behaviour may vary.
Perhaps there could be a pcap_get_fanout_policy_capabilities() or similar to return a supported bitmap for the device, allowing applications to introspect the capabilities?

Is it necessary for each subsequent pcap_ts to set the same policy, or is it assumed global?

How is the size of the group set? Is it elastic, e.g. starts at 1 and increases as more 'readers' attach, or is it pre-determined? Is there a way to 'get' the maximum number of readers in a group, or only return an error once the maximum supported number is reached?

For a DAG device we would probably only support one group, otherwise we would have to duplicate traffic at additional cost.

Would the functions be stubbed in the plugin interface, e.g. dag_pcap_set_fanout_policy() would be implemented in pcap-dag.c? If so, would pcap_set_fanout_group() be stubbed, or can we use the kernel group id space even when not creating a kernel group?

For multiple processes is it sufficient for one parent process to create the group, and pass the id to the later processes? All processes would still need appropriate permissions/capabilities.

guyharris · 2018-04-26T02:44:55Z

Perhaps there could be a pcap_get_fanout_policy_capabilities() or similar to return a supported bitmap for the device, allowing applications to introspect the capabilities?

Yes. There's the policies, for which we could return a bitmap, which would require either that we limit it to 32 or 64 policies or that we have select()-style bitsets, or we could have a call to ask whether a particular policy is supported (or just have pcap_set_fanout_policy() return a "not supported" error).

For the policy options, we'd have another call, returning a bitset of all supported options.

Is it necessary for each subsequent pcap_ts to set the same policy, or is it assumed global?

We probably should just have it be global - meaning "stored in a per-group data structure", with DAG having only one such structure - so that you don't have to set it for every pcap_t. Attempts to set it on a group that already exists would fail.

How is the size of the group set? Is it elastic, e.g. starts at 1 and increases as more 'readers' attach, or is it pre-determined? Is there a way to 'get' the maximum number of readers in a group, or only return an error once the maximum supported number is reached?

For Linux fanout, I think there's no fixed upper limit, so either one could work. Which would work better for DAG cards?

For a DAG device we would probably only support one group, otherwise we would have to duplicate traffic at additional cost.

So only group 0 would be supported; PCAP_FANOUT_GROUP_NEW would work if group 0 wasn't already created, and would fail with "no more groups allowed" if it's already been created.

Would the functions be stubbed in the plugin interface, e.g. dag_pcap_set_fanout_policy() would be implemented in pcap-dag.c?

Yes.

If so, would pcap_set_fanout_group() be stubbed

Yes. That way, DAG could implement the "fail if the group ID is non-zero or if PCAP_FANOUT_GROUP_NEW is used when group 0 is already in use" policy.

The per-group data structure would have a count of pcap_ts that are using it, so it can be released when the count drops to zero.

For multiple processes is it sufficient for one parent process to create the group, and pass the id to the later processes? All processes would still need appropriate permissions/capabilities.

That would be ideal, but that might be hard to do if the kernel isn't doing the reference counting of groups, as you can't rely on userland counting, especially if the new processes are created with new programs.

sfd · 2018-04-26T03:45:04Z

Sounds reasonable. If you put the stubs in I should be able to fill them for pcap-dag.

Is it necessary for each subsequent pcap_ts to set the same policy, or is it assumed global?

We probably should just have it be global - meaning "stored in a per-group data structure", with DAG having only one such structure - so that you don't have to set it for every pcap_t. Attempts to set it on a group that already exists would fail.

How is the size of the group set? Is it elastic, e.g. starts at 1 and increases as more 'readers' attach, or is it pre-determined? Is there a way to 'get' the maximum number of readers in a group, or only return an error once the maximum supported number is reached?

For Linux fanout, I think there's no fixed upper limit, so either one could work. Which would work better for DAG cards?

DAG cards currently support up to 32 receive streams, so up to 32-way load balancing. Simply returning an error on exceeding the limit would save one function stub, for the get_limit call.

Note each time the fanout count increases the set of flows going to each existing client changes, which may cause confusion for some stateful applications. Could it be arranged so that you could create/define the full fanout set before activating them?

So only group 0 would be supported; PCAP_FANOUT_GROUP_NEW would work if group 0 wasn't already created, and would fail with "no more groups allowed" if it's already been created.

That should work.

los93sol · 2018-09-11T02:17:01Z

Is this still active? I’m in a situation where fanout would really be beneficial and just checking to see what options are available

los93sol · 2018-09-11T02:18:14Z

Also, is this only for dag cards or does this work with any cards?

guyharris · 2018-09-11T03:26:19Z

Is this still active?

Yes, in the sense that we do want such an API.

is this only for dag cards

Not only is it NOT only for DAG cards, it wasn't even originally proposed for DAG cards, it was proposed for regular Linux network interfaces.

The goal is to come up with something that 1) can be used in code that can work with Linux interfaces, DAG cards, etc. if it uses only capabilities that make sense with all of those sets of adapters and 2) also allows code to use particular capabilities of particular interfaces if it's only going to support those interfaces.

does this work with any cards?

It will only work if the software for the adapters - and, if necessary, the hardware/firmware for the adapters - supports it.

I think all Linux adapters support the Linux fanout API, which is the API that libpcap would use for this. I don't know whether any OSes other than Linux support an API that can do the same sort of thing.

I don't know whether any adapters that don't work as "normal" network interfaces, other than DAG cards, support some form of fanout such as this.

los93sol · 2018-09-11T03:40:16Z

Very cool, thanks for clarifying all my questions. I wish I knew cpp better so I could contribute something to this effort. I’m definitely following this particular feature with great interest.

guyharris · 2018-09-11T03:54:30Z

I wish I knew cpp better so I could contribute something to this effort

C (libpcap is written in C, not C++) isn't the hard part; coming up with a libpcap API that matches what's needed by Linux, the DAG library, and other adapters' APIs is the hard part (for which I haven't had much time lately, unfortunately).

sfd · 2018-09-11T05:12:36Z

I think the FANOUT API is a good idea. I do not think adoption should be delayed to resolve any open questions about DAG support.
We can always make changes in future once libpcap becomes better aware of capture devices with multiple logical interfaces, e.g. when libpcap implements some native pcapng capture APIs.

infrastation · 2018-09-11T11:56:58Z

Would it be better to name the function something like pcap_set_fanout_linux() if the feature is Linux-specific?

sfd · 2018-09-11T20:48:41Z

No, I think it is generically useful. Many capture systems support multiple queues or streams of some form. The proposed implementation covers Linux fine.

The underlying issue is that internally libpcap doesn't have the concept of capture devices which have multiple physical interfaces. I don't think this can be properly solved until libpcap gets 'pcapng' style APIs and capture support. That is a much larger task.

kevinboulain · 2018-10-22T08:36:46Z

pcap-linux.c

+    return 0;
+#else
+	pcap_fmt_errmsg_for_errno(handle->errbuf, PCAP_ERRBUF_SIZE,
+		errno, "funout is not supported");


Probably meant fanout instead of funout.

mcr · 2019-04-26T14:08:35Z

I am closing this pull request, I think that the code needs to be rebased, and it seems like consensus is that a different mechanism should be implemented. Re-open if I'm wrong.

@xpahos