Skip to content

Commit

Permalink
📝 docs(1.0): 完善Linux相关的XDP,工具和内核网络参数文章
Browse files Browse the repository at this point in the history
Signed-off-by: Tony Deng <wolf.deng@gmail.com>
  • Loading branch information
tonydeng committed Nov 30, 2017
1 parent 6c92215 commit 296bee2
Show file tree
Hide file tree
Showing 10 changed files with 623 additions and 2 deletions.
4 changes: 2 additions & 2 deletions SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,13 @@
- [eBPF](linux/bpf/README.md)
- [bcc](linux/bpf/bcc.md)
- [故障排查](linux/bpf/troubleshooting.md)
- [XDP](linux/XDP/index.md)
- [XDP](linux/XDP/README.md)
- [XDP架构](linux/XDP/design.md)
- [使用场景](linux/XDP/use-cases.md)
- [常用工具](linux/tools.md)
- [网络抓包tcpdump](linux/tcpdump.md)
- [scapy](linux/scapy.md)
- [内核网络参数](linux/params.md)
- [内核网络参数](linux/kernel-network-params.md)
- [4. Open vSwitch](ovs/README.md)
- [OVS介绍](ovs/README.md)
- [OVS编译](ovs/build.md)
Expand Down
49 changes: 49 additions & 0 deletions linux/XDP/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# XDP

XDP(eXpress Data Path)为Linux内核提供了高性能、可编程的网络数据路径。由于网络包在还未进入网络协议栈之前就处理,它给Linux网络带来了巨大的性能提升(性能比DPDK还要高)。

![xdp-packet-processing](images/xdp-packet-processing-1024x560.png)

XDP主要的特性包括

- 在网络协议栈前处理
- 无锁设计
- 批量I/O操作
- 轮询式
- 直接队列访问
- 不需要分配skbuff
- 支持网络卸载
- DDIO
- XDP程序快速执行并结束,没有循环
- Packeting steering

## 与DPDK对比

相对于DPDK,XDP具有以下优点

- 无需第三方代码库和许可
- 同时支持轮询式和中断式网络
- 无需分配大页
- 无需专用的CPU
- 无需定义新的安全网络模型

## 示例

- [Linux内核BPF示例](https://github.com/torvalds/linux/tree/master/samples/bpf)
- [prototype-kernel示例](https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf)
- [libbpf](https://github.com/torvalds/linux/tree/master/tools/lib/bpf)

## 缺点

注意XDP的性能提升是有代价的,它牺牲了通用型和公平性

- XDP不提供缓存队列(qdisc),TX设备太慢时直接丢包,因而不要在RX比TX快的设备上使用XDP
- XDP程序是专用的,不具备网络协议栈的通用性

## 参考

- [Introduction to XDP](https://www.iovisor.org/technology/xdp)
- [Network Performance BoF](http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html)
- [XDP Introduction and Use-cases](http://people.netfilter.org/hawk/presentations/xdp2016/xdp_intro_and_use_cases_sep2016.pdf)
- [Linux Network Stack](http://people.netfilter.org/hawk/presentations/theCamp2016/theCamp2016_next_steps_for_linux.pdf)
- [NetDev 1.2 video](https://www.youtube.com/watch?v=NlMQ0i09HMU&feature=youtu.be&t=3m3s)
22 changes: 22 additions & 0 deletions linux/XDP/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# XDP架构

XDP基于一系列的技术来实现高性能和可编程性,包括

- 基于eBPF
- Capabilities negotiation:通过协商确定网卡驱动支持的特性,XDP尽量利用新特性,但网卡驱动不需要支持所有的特性
- 在网络协议栈前处理
- 无锁设计
- 批量I/O操作
- 轮询式
- 直接队列访问
- 不需要分配skbuff
- 支持网络卸载
- DDIO
- XDP程序快速执行并结束,没有循环
- Packeting steering

## 包处理逻辑

如下图所示,基于内核的eBPF程序处理包,每个RX队列分配一个CPU,且以每个网络包一个Page(packet-page)的方式避免分配skbuff。

![xdp-packet-processor](images/packet-processor.png)
Binary file added linux/XDP/images/packet-processor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions linux/XDP/use-cases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# XDP使用场景

XDP的使用场景包括

- DDoS防御
- 防火墙
- 基于`XDP_TX`的负载均衡
- 网络统计
- 复杂网络采样
- 高速交易平台
201 changes: 201 additions & 0 deletions linux/kernel-network-params.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# 内核中的网络参数

## nf_conntrack

`nf_conntrack`是Linux内核连接跟踪的模块,常用在`iptables`中,比如

```
-A INPUT -m state --state RELATED,ESTABLISHED -j RETURN
-A INPUT -m state --state INVALID -j DROP
```

可以通过`cat /proc/net/nf_conntrack`来查看当前跟踪的连接信息,这些信息以哈希形式(用链地址法处理冲突)存在内存中,并且每条记录大约占300B空间。

`nf_conntrack`相关的内核参数有三个:

- `nf_conntrack_max`:连接跟踪表的大小,建议根据内存计算该值`CONNTRACK_MAX = RAMSIZE (in bytes) / 16384 / (x / 32)`,并满足`nf_conntrack_max=4*nf_conntrack_buckets`,默认262144
- `nf_conntrack_buckets`:哈希表的大小,(`nf_conntrack_max/nf_conntrack_buckets`就是每条哈希记录链表的长度),默认65536
- `nf_conntrack_tcp_timeout_established`:tcp会话的超时时间,默认是432000 (5天)

比如,对64G内存的机器,推荐配置:

```
net.netfilter.nf_conntrack_max=4194304
net.netfilter.nf_conntrack_tcp_timeout_established=300
net.netfilter.nf_conntrack_buckets=1048576
```

## bridge-nf

bridge-nf使得netfilter可以对Linux网桥上的IPv4/ARP/IPv6包过滤。比如,设置`net.bridge.bridge-nf-call-iptables=1`后,二层的网桥在转发包时也会被iptables的FORWARD规则所过滤,这样有时会出现L3层的iptables rules去过滤L2的帧的问题(见[这里](https://bugzilla.redhat.com/show_bug.cgi?id=512206))。

常用的选项包括

- `net.bridge.bridge-nf-call-arptables`:是否在`arptables`的FORWARD中过滤网桥的ARP包
- `net.bridge.bridge-nf-call-ip6tables`:是否在`ip6tables`链中过滤IPv6包
- `net.bridge.bridge-nf-call-iptables`:是否在`iptables`链中过滤IPv4包
- `net.bridge.bridge-nf-filter-vlan-tagged`:是否在i`ptables/arptables`中过滤打了vlan标签的包

当然,也可以通过`/sys/devices/virtual/net/<bridge-name>/bridge/nf_call_iptables`来设置,但要注意内核是取两者中大的生效。

有时,可能只希望部分网桥禁止bridge-nf,而其他网桥都开启(比如CNI网络插件中一般要求bridge-nf-call-iptables选项开启,而有时又希望禁止某个网桥的bridge-nf),这时可以改用iptables的方法:

```sh
iptables -t raw -I PREROUTING -i <bridge-name> -j NOTRACK
```

## 反向路径过滤

反向路径过滤可用于防止数据包从一接口传入,又从另一不同的接口传出(这有时被称为 “非对称路由” )。除非必要,否则最好将其关闭,因为它可防止来自子网络的用户采用 IP 地址欺骗手段,并减少 DDoS (分布式拒绝服务)攻击的机会。

通过 rp_filter 选项启用反向路径过滤,比如 `sysctl -w net.ipv4.conf.default.rp_filter=INTEGER`。支持三种选项:

- 0 ——未进行源验证。
- 1 ——处于如 RFC3704 所定义的严格模式。
- 2 ——处于如 RFC3704 所定义的松散模式。

可以通过 `net.ipv4.interface.rp_filter`可实现对每一网络接口设置的覆盖。

## TCP相关

**参数**| **描述**| **默认值**| **优化值**
-------| ------- | -------- | --------
`net.core.rmem_default` | 默认的TCP数据接收窗口大小(字节)| 212992|
`net.core.rmem_max` | 最大的TCP数据接收窗口(字节)。 | 212992 |
`net.core.wmem_default` | 默认的TCP数据发送窗口大小(字节)。 | 212992 |
`net.core.wmem_max` | 最大的TCP数据发送窗口(字节)。 | 212992 |
`net.core.netdev_max_backlog` | 在每个网络接口接收数据包的速率比内核处理这些包的速率快时,允许送到队列的数据包的最大数目。| 1000 | 10000
`net.core.somaxconn` | 定义了系统中每一个端口最大的监听队列的长度,这是个全局的参数。 | 128 | 2048
`net.core.optmem_max`| 表示每个套接字所允许的最大缓冲区的大小。 | 20480 | 81920
`net.ipv4.tcp_mem` | 确定TCP栈应该如何反映内存使用,每个值的单位都是内存页(通常是4KB)。第一个值是内存使用的下限;第二个值是内存压力模式开始对缓冲区使用应用压力的上限;第三个值是内存使用的上限。在这个层次上可以将报文丢弃,从而减少对内存的使用。对于较大的BDP可以增大这些值(注意,其单位是内存页而不是字节)。 | 5814 7754 11628 |
`net.ipv4.tcp_rmem` | 为自动调优定义socket使用的内存。第一个值是为socket接收缓冲区分配的最少字节数;第二个值是默认值(该值会被`rmem_default`覆盖),缓冲区在系统负载不重的情况下可以增长到这个值;第三个值是接收缓冲区空间的最大字节数(该值会被`rmem_max`覆盖)。 | 4096 87380 3970528 |
`net.ipv4.tcp_wmem` | 为自动调优定义socket使用的内存。第一个值是为socket发送缓冲区分配的最少字节数;第二个值是默认值(该值会被`wmem_default`覆盖),缓冲区在系统负载不重的情况下可以增长到这个值;第三个值是发送缓冲区空间的最大字节数(该值会被`wmem_max`覆盖)。 | 4096 16384 3970528 |
`net.ipv4.tcp_keepalive_time` | TCP发送keepalive探测消息的间隔时间(秒),用于确认TCP连接是否有效。 | 7200 | 1800
`net.ipv4.tcp_keepalive_intvl` | 探测消息未获得响应时,重发该消息的间隔时间(秒) | 75 | 30
`net.ipv4.tcp_keepalive_probes` | 在认定TCP连接失效之前,最多发送多少个keepalive探测消息。 | 9 | 3
`net.ipv4.tcp_sack` | 启用有选择的应答(1表示启用),通过有选择地应答乱序接收到的报文来提高性能,让发送者只发送丢失的报文段,(对于广域网通信来说)这个选项应该启用,但是会增加对CPU的占用。 | 1 | 1
`net.ipv4.tcp_fack` | 启用转发应答,可以进行有选择应答(SACK)从而减少拥塞情况的发生,这个选项也应该启用。 | 1 | 1
`net.ipv4.tcp_timestamps` | TCP时间戳(会在TCP包头增加12个字节),以一种比重发超时更精确的方法(参考RFC 1323)来启用对RTT 的计算,为实现更好的性能应该启用这个选项。 | 1 | 1
`net.ipv4.tcp_window_scaling` | 启用RFC 1323定义的window scaling,要支持超过64KB的TCP窗口,必须启用该值(1表示启用),TCP窗口最大至1GB,TCP连接双方都启用时才生效。 | 1 | 1
`net.ipv4.tcp_syncookies` | 表示是否打开TCP同步标签(syncookie),内核必须打开了`CONFIG_SYN_COOKIES`项进行编译,同步标签可以防止一个套接字在有过多试图连接到达时引起过载。 | 1 | 1
`net.ipv4.tcp_tw_reuse` | 表示是否允许将处于TIME-WAIT状态的socket(TIME-WAIT的端口)用于新的TCP连接 。 | 0 | 1
`net.ipv4.tcp_tw_recycle` | 能够更快地回收TIME-WAIT套接字。 | 0 | 1
`net.ipv4.tcp_fin_timeout` | 对于本端断开的socket连接,TCP保持在FIN-WAIT-2状态的时间(秒)。对方可能会断开连接或一直不结束连接或不可预料的进程死亡。 | 60 | 30
`net.ipv4.ip_local_port_range` | 表示TCP/UDP协议允许使用的本地端口号 | 32768 60999 | 1024 65000
`net.ipv4.tcp_max_syn_backlog` | 对于还未获得对方确认的连接请求,可保存在队列中的最大数目。如果服务器经常出现过载,可以尝试增加这个数字。 | 128 |
`net.ipv4.tcp_low_latency` | 允许TCP/IP栈适应在高吞吐量情况下低延时的情况,这个选项应该禁用。 | 0 | 0

## ARP相关

### ARP回收

- `gc_stale_time` 每次检查neighbour记录的有效性的周期。当neighbour记录失效时,将在给它发送数据前再解析一次。缺省值是60秒。
- `gc_thresh1` 存在于ARP高速缓存中的最少记录数,如果少于这个数,垃圾收集器将不会运行。缺省值是128。
- `gc_thresh2` 存在 ARP 高速缓存中的最多的记录软限制。垃圾收集器在开始收集前,允许记录数超过这个数字 5 秒。缺省值是 512。
- `gc_thresh3` 保存在 ARP 高速缓存中的最多记录的硬限制,一旦高速缓存中的数目高于此,垃圾收集器将马上运行。缺省值是1024。

比如可以增大为

```
net.ipv4.neigh.default.gc_thresh1=1024
net.ipv4.neigh.default.gc_thresh2=4096
net.ipv4.neigh.default.gc_thresh3=8192
```

### ARP过滤

arp_filter - BOOLEAN

1 - Allows you to have multiple network interfaces on the same
subnet, and have the ARPs for each interface be answered
based on whether or not the kernel would route a packet from
the ARP'd IP out that interface (therefore you must use source
based routing for this to work). In other words it allows control
of which cards (usually 1) will respond to an arp request.

0 - (default) The kernel can respond to arp requests with addresses
from other interfaces. This may seem wrong but it usually makes
sense, because it increases the chance of successful communication.
IP addresses are owned by the complete host on Linux, not by
particular interfaces. Only for more complex setups like load-
balancing, does this behaviour cause problems.

arp_filter for the interface will be enabled if at least one of
conf/{all,interface}/arp_filter is set to TRUE,
it will be disabled otherwise

arp_announce - INTEGER

Define different restriction levels for announcing the local
source IP address from IP packets in ARP requests sent on
interface:
0 - (default) Use any local address, configured on any interface
1 - Try to avoid local addresses that are not in the target's
subnet for this interface. This mode is useful when target
hosts reachable via this interface require the source IP
address in ARP requests to be part of their logical network
configured on the receiving interface. When we generate the
request we will check all our subnets that include the
target IP and will preserve the source address if it is from
such subnet. If there is no such subnet we select source
address according to the rules for level 2.
2 - Always use the best local address for this target.
In this mode we ignore the source address in the IP packet
and try to select local address that we prefer for talks with
the target host. Such local address is selected by looking
for primary IP addresses on all our subnets on the outgoing
interface that include the target IP address. If no suitable
local address is found we select the first local address
we have on the outgoing interface or on all other interfaces,
with the hope we will receive reply for our request and
even sometimes no matter the source IP address we announce.

The max value from conf/{all,interface}/arp_announce is used.

Increasing the restriction level gives more chance for
receiving answer from the resolved target while decreasing
the level announces more valid sender's information.

arp_ignore - INTEGER

Define different modes for sending replies in response to
received ARP requests that resolve local target IP addresses:
0 - (default): reply for any local target IP address, configured
on any interface
1 - reply only if the target IP address is local address
configured on the incoming interface
2 - reply only if the target IP address is local address
configured on the incoming interface and both with the
sender's IP address are part from same subnet on this interface
3 - do not reply for local addresses configured with scope host,
only resolutions for global and link addresses are replied
4-7 - reserved
8 - do not reply for all local addresses

The max value from conf/{all,interface}/arp_ignore is used
when ARP request is received on the {interface}

arp_notify - BOOLEAN

Define mode for notification of address and device changes.
0 - (default): do nothing
1 - Generate gratuitous arp requests when device is brought up
or hardware address changes.

arp_accept - BOOLEAN

Define behavior for gratuitous ARP frames who's IP is not
already present in the ARP table:
0 - don't create new entries in the ARP table
1 - create new entries in the ARP table

Both replies and requests type gratuitous arp will trigger the
ARP table to be updated, if this setting is on.

If the ARP table already contains the IP address of the
gratuitous arp frame, the arp table will be updated regardless
if this setting is on or off.

## 参考文档

- [Linux Kernel ip sysctl documentation](https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt)
Loading

0 comments on commit 296bee2

Please sign in to comment.