Skip to content

Commit

Permalink
Merge branch 'soreuseport'
Browse files Browse the repository at this point in the history
Tom Herbert says:

====================
This series implements so_reuseport (SO_REUSEPORT socket option) for
TCP and UDP.  For TCP, so_reuseport allows multiple listener sockets
to be bound to the same port.  In the case of UDP, so_reuseport allows
multiple sockets to bind to the same port.  To prevent port hijacking
all sockets bound to the same port using so_reuseport must have the
same uid.  Received packets are distributed to multiple sockets bound
to the same port using a 4-tuple hash.

The motivating case for so_resuseport in TCP would be something like
a web server binding to port 80 running with multiple threads, where
each thread might have it's own listener socket.  This could be done
as an alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads.  In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets.  We have seen the  disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest.  With so_reusport the distribution is
uniform.

The TCP implementation has a problem in that the request sockets for a
listener are attached to a listener socket.  If a SYN is received, a
listener socket is chosen and request structure is created (SYN-RECV
state).  If the subsequent ack in 3WHS does not match the same port
by so_reusport, the connection state is not found (reset) and the
request structure is orphaned.  This scenario would occur when the
number of listener sockets bound to a port changes (new ones are
added, or old ones closed).  We are looking for a solution to this,
maybe allow multiple sockets to share the same request table...

The motivating case for so_reuseport in UDP would be something like a
DNS server.  An alternative would be to recv on the same socket from
multiple threads.  As in the case of TCP, the load across these threads
tends to be disproportionate and we also see a lot of contection on
the socket lock.  Note that SO_REUSEADDR already allows multiple UDP
sockets to bind to the same port, however there is no provision to
prevent hijacking and nothing to distribute packets across all the
sockets sharing the same bound port.  This patch does not change the
semantics of SO_REUSEADDR, but provides usable functionality of it
for unicast.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
  • Loading branch information
davem330 committed Jan 23, 2013
2 parents 4a633a6 + 72289b9 commit c617f39
Show file tree
Hide file tree
Showing 29 changed files with 213 additions and 70 deletions.
2 changes: 1 addition & 1 deletion arch/alpha/include/uapi/asm/socket.h
Expand Up @@ -19,7 +19,7 @@
#define SO_BROADCAST 0x0020
#define SO_LINGER 0x0080
#define SO_OOBINLINE 0x0100
/* To add :#define SO_REUSEPORT 0x0200 */
#define SO_REUSEPORT 0x0200

#define SO_TYPE 0x1008
#define SO_ERROR 0x1007
Expand Down
2 changes: 1 addition & 1 deletion arch/avr32/include/uapi/asm/socket.h
Expand Up @@ -22,7 +22,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
2 changes: 1 addition & 1 deletion arch/cris/include/uapi/asm/socket.h
Expand Up @@ -24,7 +24,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
2 changes: 1 addition & 1 deletion arch/frv/include/uapi/asm/socket.h
Expand Up @@ -22,7 +22,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
2 changes: 1 addition & 1 deletion arch/h8300/include/uapi/asm/socket.h
Expand Up @@ -22,7 +22,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
2 changes: 1 addition & 1 deletion arch/ia64/include/uapi/asm/socket.h
Expand Up @@ -31,7 +31,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
2 changes: 1 addition & 1 deletion arch/m32r/include/uapi/asm/socket.h
Expand Up @@ -22,7 +22,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
3 changes: 1 addition & 2 deletions arch/mips/include/uapi/asm/socket.h
Expand Up @@ -28,8 +28,7 @@
#define SO_LINGER 0x0080 /* Block on close of a reliable
socket to transmit pending data. */
#define SO_OOBINLINE 0x0100 /* Receive out-of-band data in-band. */
#if 0
To add: #define SO_REUSEPORT 0x0200 /* Allow local address and port reuse. */
#define SO_REUSEPORT 0x0200 /* Allow local address and port reuse. */
#endif

#define SO_TYPE 0x1008 /* Compatible name for SO_STYLE. */
Expand Down
2 changes: 1 addition & 1 deletion arch/mn10300/include/uapi/asm/socket.h
Expand Up @@ -22,7 +22,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
2 changes: 1 addition & 1 deletion arch/parisc/include/uapi/asm/socket.h
Expand Up @@ -13,7 +13,7 @@
#define SO_BROADCAST 0x0020
#define SO_LINGER 0x0080
#define SO_OOBINLINE 0x0100
/* To add :#define SO_REUSEPORT 0x0200 */
#define SO_REUSEPORT 0x0200
#define SO_SNDBUF 0x1001
#define SO_RCVBUF 0x1002
#define SO_SNDBUFFORCE 0x100a
Expand Down
2 changes: 1 addition & 1 deletion arch/powerpc/include/uapi/asm/socket.h
Expand Up @@ -29,7 +29,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_RCVLOWAT 16
#define SO_SNDLOWAT 17
#define SO_RCVTIMEO 18
Expand Down
2 changes: 1 addition & 1 deletion arch/s390/include/uapi/asm/socket.h
Expand Up @@ -28,7 +28,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
2 changes: 1 addition & 1 deletion arch/sparc/include/uapi/asm/socket.h
Expand Up @@ -15,7 +15,7 @@
#define SO_PEERCRED 0x0040
#define SO_LINGER 0x0080
#define SO_OOBINLINE 0x0100
/* To add :#define SO_REUSEPORT 0x0200 */
#define SO_REUSEPORT 0x0200
#define SO_BSDCOMPAT 0x0400
#define SO_RCVLOWAT 0x0800
#define SO_SNDLOWAT 0x1000
Expand Down
2 changes: 1 addition & 1 deletion arch/xtensa/include/uapi/asm/socket.h
Expand Up @@ -32,7 +32,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */
#define SO_REUSEPORT 15
#define SO_PASSCRED 16
#define SO_PEERCRED 17
#define SO_RCVLOWAT 18
Expand Down
6 changes: 6 additions & 0 deletions include/linux/random.h
Expand Up @@ -74,4 +74,10 @@ static inline int arch_get_random_int(unsigned int *v)
}
#endif

/* Pseudo random number generator from numerical recipes. */
static inline u32 next_pseudo_random32(u32 seed)
{
return seed * 1664525 + 1013904223;
}

#endif /* _LINUX_RANDOM_H */
5 changes: 4 additions & 1 deletion include/net/inet6_hashtables.h
Expand Up @@ -71,6 +71,8 @@ extern struct sock *__inet6_lookup_established(struct net *net,

extern struct sock *inet6_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
const struct in6_addr *saddr,
const __be16 sport,
const struct in6_addr *daddr,
const unsigned short hnum,
const int dif);
Expand All @@ -88,7 +90,8 @@ static inline struct sock *__inet6_lookup(struct net *net,
if (sk)
return sk;

return inet6_lookup_listener(net, hashinfo, daddr, hnum, dif);
return inet6_lookup_listener(net, hashinfo, saddr, sport,
daddr, hnum, dif);
}

static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
Expand Down
13 changes: 10 additions & 3 deletions include/net/inet_hashtables.h
Expand Up @@ -81,7 +81,9 @@ struct inet_bind_bucket {
struct net *ib_net;
#endif
unsigned short port;
signed short fastreuse;
signed char fastreuse;
signed char fastreuseport;
kuid_t fastuid;
int num_owners;
struct hlist_node node;
struct hlist_head owners;
Expand Down Expand Up @@ -257,15 +259,19 @@ extern void inet_unhash(struct sock *sk);

extern struct sock *__inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
const __be32 saddr,
const __be16 sport,
const __be32 daddr,
const unsigned short hnum,
const int dif);

static inline struct sock *inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
__be32 saddr, __be16 sport,
__be32 daddr, __be16 dport, int dif)
{
return __inet_lookup_listener(net, hashinfo, daddr, ntohs(dport), dif);
return __inet_lookup_listener(net, hashinfo, saddr, sport,
daddr, ntohs(dport), dif);
}

/* Socket demux engine toys. */
Expand Down Expand Up @@ -358,7 +364,8 @@ static inline struct sock *__inet_lookup(struct net *net,
struct sock *sk = __inet_lookup_established(net, hashinfo,
saddr, sport, daddr, hnum, dif);

return sk ? : __inet_lookup_listener(net, hashinfo, daddr, hnum, dif);
return sk ? : __inet_lookup_listener(net, hashinfo, saddr, sport,
daddr, hnum, dif);
}

static inline struct sock *inet_lookup(struct net *net,
Expand Down
2 changes: 2 additions & 0 deletions include/net/netfilter/nf_tproxy_core.h
Expand Up @@ -82,6 +82,7 @@ nf_tproxy_get_sock_v4(struct net *net, const u8 protocol,
break;
case NFT_LOOKUP_LISTENER:
sk = inet_lookup_listener(net, &tcp_hashinfo,
saddr, sport,
daddr, dport,
in->ifindex);

Expand Down Expand Up @@ -151,6 +152,7 @@ nf_tproxy_get_sock_v6(struct net *net, const u8 protocol,
break;
case NFT_LOOKUP_LISTENER:
sk = inet6_lookup_listener(net, &tcp_hashinfo,
saddr, sport,
daddr, ntohs(dport),
in->ifindex);

Expand Down
5 changes: 4 additions & 1 deletion include/net/sock.h
Expand Up @@ -140,6 +140,7 @@ typedef __u64 __bitwise __addrpair;
* @skc_family: network address family
* @skc_state: Connection state
* @skc_reuse: %SO_REUSEADDR setting
* @skc_reuseport: %SO_REUSEPORT setting
* @skc_bound_dev_if: bound device index if != 0
* @skc_bind_node: bind hash linkage for various protocol lookup tables
* @skc_portaddr_node: second hash linkage for UDP/UDP-Lite protocol
Expand Down Expand Up @@ -179,7 +180,8 @@ struct sock_common {

unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;
unsigned char skc_reuse:4;
unsigned char skc_reuseport:4;
int skc_bound_dev_if;
union {
struct hlist_node skc_bind_node;
Expand Down Expand Up @@ -297,6 +299,7 @@ struct sock {
#define sk_family __sk_common.skc_family
#define sk_state __sk_common.skc_state
#define sk_reuse __sk_common.skc_reuse
#define sk_reuseport __sk_common.skc_reuseport
#define sk_bound_dev_if __sk_common.skc_bound_dev_if
#define sk_bind_node __sk_common.skc_bind_node
#define sk_prot __sk_common.skc_prot
Expand Down
3 changes: 1 addition & 2 deletions include/uapi/asm-generic/socket.h
Expand Up @@ -22,8 +22,7 @@
#define SO_PRIORITY 12
#define SO_LINGER 13
#define SO_BSDCOMPAT 14
/* To add :#define SO_REUSEPORT 15 */

#define SO_REUSEPORT 15
#ifndef SO_PASSCRED /* powerpc only differs in these */
#define SO_PASSCRED 16
#define SO_PEERCRED 17
Expand Down
7 changes: 7 additions & 0 deletions net/core/sock.c
Expand Up @@ -665,6 +665,9 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
case SO_REUSEADDR:
sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
break;
case SO_REUSEPORT:
sk->sk_reuseport = valbool;
break;
case SO_TYPE:
case SO_PROTOCOL:
case SO_DOMAIN:
Expand Down Expand Up @@ -972,6 +975,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
v.val = sk->sk_reuse;
break;

case SO_REUSEPORT:
v.val = sk->sk_reuseport;
break;

case SO_KEEPALIVE:
v.val = sock_flag(sk, SOCK_KEEPOPEN);
break;
Expand Down
48 changes: 37 additions & 11 deletions net/ipv4/inet_connection_sock.c
Expand Up @@ -59,6 +59,8 @@ int inet_csk_bind_conflict(const struct sock *sk,
struct sock *sk2;
struct hlist_node *node;
int reuse = sk->sk_reuse;
int reuseport = sk->sk_reuseport;
kuid_t uid = sock_i_uid((struct sock *)sk);

/*
* Unlike other sk lookup places we do not check
Expand All @@ -73,8 +75,11 @@ int inet_csk_bind_conflict(const struct sock *sk,
(!sk->sk_bound_dev_if ||
!sk2->sk_bound_dev_if ||
sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
if (!reuse || !sk2->sk_reuse ||
sk2->sk_state == TCP_LISTEN) {
if ((!reuse || !sk2->sk_reuse ||
sk2->sk_state == TCP_LISTEN) &&
(!reuseport || !sk2->sk_reuseport ||
(sk2->sk_state != TCP_TIME_WAIT &&
!uid_eq(uid, sock_i_uid(sk2))))) {
const __be32 sk2_rcv_saddr = sk_rcv_saddr(sk2);
if (!sk2_rcv_saddr || !sk_rcv_saddr(sk) ||
sk2_rcv_saddr == sk_rcv_saddr(sk))
Expand Down Expand Up @@ -106,6 +111,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
int ret, attempts = 5;
struct net *net = sock_net(sk);
int smallest_size = -1, smallest_rover;
kuid_t uid = sock_i_uid(sk);

local_bh_disable();
if (!snum) {
Expand All @@ -125,9 +131,12 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
spin_lock(&head->lock);
inet_bind_bucket_for_each(tb, node, &head->chain)
if (net_eq(ib_net(tb), net) && tb->port == rover) {
if (tb->fastreuse > 0 &&
sk->sk_reuse &&
sk->sk_state != TCP_LISTEN &&
if (((tb->fastreuse > 0 &&
sk->sk_reuse &&
sk->sk_state != TCP_LISTEN) ||
(tb->fastreuseport > 0 &&
sk->sk_reuseport &&
uid_eq(tb->fastuid, uid))) &&
(tb->num_owners < smallest_size || smallest_size == -1)) {
smallest_size = tb->num_owners;
smallest_rover = rover;
Expand Down Expand Up @@ -185,14 +194,17 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
if (sk->sk_reuse == SK_FORCE_REUSE)
goto success;

if (tb->fastreuse > 0 &&
sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
if (((tb->fastreuse > 0 &&
sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
(tb->fastreuseport > 0 &&
sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
smallest_size == -1) {
goto success;
} else {
ret = 1;
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
if (((sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
(sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
smallest_size != -1 && --attempts >= 0) {
spin_unlock(&head->lock);
goto again;
Expand All @@ -212,9 +224,23 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
tb->fastreuse = 1;
else
tb->fastreuse = 0;
} else if (tb->fastreuse &&
(!sk->sk_reuse || sk->sk_state == TCP_LISTEN))
tb->fastreuse = 0;
if (sk->sk_reuseport) {
tb->fastreuseport = 1;
tb->fastuid = uid;
} else {
tb->fastreuseport = 0;
tb->fastuid = 0;
}
} else {
if (tb->fastreuse &&
(!sk->sk_reuse || sk->sk_state == TCP_LISTEN))
tb->fastreuse = 0;
if (tb->fastreuseport &&
(!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))) {
tb->fastreuseport = 0;
tb->fastuid = 0;
}
}
success:
if (!inet_csk(sk)->icsk_bind_hash)
inet_bind_hash(sk, tb, snum);
Expand Down

0 comments on commit c617f39

Please sign in to comment.