-
Notifications
You must be signed in to change notification settings - Fork 250
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
📝 docs(1.0): 完善Linux相关的XDP,工具和内核网络参数文章
Signed-off-by: Tony Deng <wolf.deng@gmail.com>
- Loading branch information
Showing
10 changed files
with
623 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# XDP | ||
|
||
XDP(eXpress Data Path)为Linux内核提供了高性能、可编程的网络数据路径。由于网络包在还未进入网络协议栈之前就处理,它给Linux网络带来了巨大的性能提升(性能比DPDK还要高)。 | ||
|
||
![xdp-packet-processing](images/xdp-packet-processing-1024x560.png) | ||
|
||
XDP主要的特性包括 | ||
|
||
- 在网络协议栈前处理 | ||
- 无锁设计 | ||
- 批量I/O操作 | ||
- 轮询式 | ||
- 直接队列访问 | ||
- 不需要分配skbuff | ||
- 支持网络卸载 | ||
- DDIO | ||
- XDP程序快速执行并结束,没有循环 | ||
- Packeting steering | ||
|
||
## 与DPDK对比 | ||
|
||
相对于DPDK,XDP具有以下优点 | ||
|
||
- 无需第三方代码库和许可 | ||
- 同时支持轮询式和中断式网络 | ||
- 无需分配大页 | ||
- 无需专用的CPU | ||
- 无需定义新的安全网络模型 | ||
|
||
## 示例 | ||
|
||
- [Linux内核BPF示例](https://github.com/torvalds/linux/tree/master/samples/bpf) | ||
- [prototype-kernel示例](https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf) | ||
- [libbpf](https://github.com/torvalds/linux/tree/master/tools/lib/bpf) | ||
|
||
## 缺点 | ||
|
||
注意XDP的性能提升是有代价的,它牺牲了通用型和公平性 | ||
|
||
- XDP不提供缓存队列(qdisc),TX设备太慢时直接丢包,因而不要在RX比TX快的设备上使用XDP | ||
- XDP程序是专用的,不具备网络协议栈的通用性 | ||
|
||
## 参考 | ||
|
||
- [Introduction to XDP](https://www.iovisor.org/technology/xdp) | ||
- [Network Performance BoF](http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html) | ||
- [XDP Introduction and Use-cases](http://people.netfilter.org/hawk/presentations/xdp2016/xdp_intro_and_use_cases_sep2016.pdf) | ||
- [Linux Network Stack](http://people.netfilter.org/hawk/presentations/theCamp2016/theCamp2016_next_steps_for_linux.pdf) | ||
- [NetDev 1.2 video](https://www.youtube.com/watch?v=NlMQ0i09HMU&feature=youtu.be&t=3m3s) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# XDP架构 | ||
|
||
XDP基于一系列的技术来实现高性能和可编程性,包括 | ||
|
||
- 基于eBPF | ||
- Capabilities negotiation:通过协商确定网卡驱动支持的特性,XDP尽量利用新特性,但网卡驱动不需要支持所有的特性 | ||
- 在网络协议栈前处理 | ||
- 无锁设计 | ||
- 批量I/O操作 | ||
- 轮询式 | ||
- 直接队列访问 | ||
- 不需要分配skbuff | ||
- 支持网络卸载 | ||
- DDIO | ||
- XDP程序快速执行并结束,没有循环 | ||
- Packeting steering | ||
|
||
## 包处理逻辑 | ||
|
||
如下图所示,基于内核的eBPF程序处理包,每个RX队列分配一个CPU,且以每个网络包一个Page(packet-page)的方式避免分配skbuff。 | ||
|
||
![xdp-packet-processor](images/packet-processor.png) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# XDP使用场景 | ||
|
||
XDP的使用场景包括 | ||
|
||
- DDoS防御 | ||
- 防火墙 | ||
- 基于`XDP_TX`的负载均衡 | ||
- 网络统计 | ||
- 复杂网络采样 | ||
- 高速交易平台 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,201 @@ | ||
# 内核中的网络参数 | ||
|
||
## nf_conntrack | ||
|
||
`nf_conntrack`是Linux内核连接跟踪的模块,常用在`iptables`中,比如 | ||
|
||
``` | ||
-A INPUT -m state --state RELATED,ESTABLISHED -j RETURN | ||
-A INPUT -m state --state INVALID -j DROP | ||
``` | ||
|
||
可以通过`cat /proc/net/nf_conntrack`来查看当前跟踪的连接信息,这些信息以哈希形式(用链地址法处理冲突)存在内存中,并且每条记录大约占300B空间。 | ||
|
||
与`nf_conntrack`相关的内核参数有三个: | ||
|
||
- `nf_conntrack_max`:连接跟踪表的大小,建议根据内存计算该值`CONNTRACK_MAX = RAMSIZE (in bytes) / 16384 / (x / 32)`,并满足`nf_conntrack_max=4*nf_conntrack_buckets`,默认262144 | ||
- `nf_conntrack_buckets`:哈希表的大小,(`nf_conntrack_max/nf_conntrack_buckets`就是每条哈希记录链表的长度),默认65536 | ||
- `nf_conntrack_tcp_timeout_established`:tcp会话的超时时间,默认是432000 (5天) | ||
|
||
比如,对64G内存的机器,推荐配置: | ||
|
||
``` | ||
net.netfilter.nf_conntrack_max=4194304 | ||
net.netfilter.nf_conntrack_tcp_timeout_established=300 | ||
net.netfilter.nf_conntrack_buckets=1048576 | ||
``` | ||
|
||
## bridge-nf | ||
|
||
bridge-nf使得netfilter可以对Linux网桥上的IPv4/ARP/IPv6包过滤。比如,设置`net.bridge.bridge-nf-call-iptables=1`后,二层的网桥在转发包时也会被iptables的FORWARD规则所过滤,这样有时会出现L3层的iptables rules去过滤L2的帧的问题(见[这里](https://bugzilla.redhat.com/show_bug.cgi?id=512206))。 | ||
|
||
常用的选项包括 | ||
|
||
- `net.bridge.bridge-nf-call-arptables`:是否在`arptables`的FORWARD中过滤网桥的ARP包 | ||
- `net.bridge.bridge-nf-call-ip6tables`:是否在`ip6tables`链中过滤IPv6包 | ||
- `net.bridge.bridge-nf-call-iptables`:是否在`iptables`链中过滤IPv4包 | ||
- `net.bridge.bridge-nf-filter-vlan-tagged`:是否在i`ptables/arptables`中过滤打了vlan标签的包 | ||
|
||
当然,也可以通过`/sys/devices/virtual/net/<bridge-name>/bridge/nf_call_iptables`来设置,但要注意内核是取两者中大的生效。 | ||
|
||
有时,可能只希望部分网桥禁止bridge-nf,而其他网桥都开启(比如CNI网络插件中一般要求bridge-nf-call-iptables选项开启,而有时又希望禁止某个网桥的bridge-nf),这时可以改用iptables的方法: | ||
|
||
```sh | ||
iptables -t raw -I PREROUTING -i <bridge-name> -j NOTRACK | ||
``` | ||
|
||
## 反向路径过滤 | ||
|
||
反向路径过滤可用于防止数据包从一接口传入,又从另一不同的接口传出(这有时被称为 “非对称路由” )。除非必要,否则最好将其关闭,因为它可防止来自子网络的用户采用 IP 地址欺骗手段,并减少 DDoS (分布式拒绝服务)攻击的机会。 | ||
|
||
通过 rp_filter 选项启用反向路径过滤,比如 `sysctl -w net.ipv4.conf.default.rp_filter=INTEGER`。支持三种选项: | ||
|
||
- 0 ——未进行源验证。 | ||
- 1 ——处于如 RFC3704 所定义的严格模式。 | ||
- 2 ——处于如 RFC3704 所定义的松散模式。 | ||
|
||
可以通过 `net.ipv4.interface.rp_filter`可实现对每一网络接口设置的覆盖。 | ||
|
||
## TCP相关 | ||
|
||
**参数**| **描述**| **默认值**| **优化值** | ||
-------| ------- | -------- | -------- | ||
`net.core.rmem_default` | 默认的TCP数据接收窗口大小(字节)| 212992| | ||
`net.core.rmem_max` | 最大的TCP数据接收窗口(字节)。 | 212992 | | ||
`net.core.wmem_default` | 默认的TCP数据发送窗口大小(字节)。 | 212992 | | ||
`net.core.wmem_max` | 最大的TCP数据发送窗口(字节)。 | 212992 | | ||
`net.core.netdev_max_backlog` | 在每个网络接口接收数据包的速率比内核处理这些包的速率快时,允许送到队列的数据包的最大数目。| 1000 | 10000 | ||
`net.core.somaxconn` | 定义了系统中每一个端口最大的监听队列的长度,这是个全局的参数。 | 128 | 2048 | ||
`net.core.optmem_max`| 表示每个套接字所允许的最大缓冲区的大小。 | 20480 | 81920 | ||
`net.ipv4.tcp_mem` | 确定TCP栈应该如何反映内存使用,每个值的单位都是内存页(通常是4KB)。第一个值是内存使用的下限;第二个值是内存压力模式开始对缓冲区使用应用压力的上限;第三个值是内存使用的上限。在这个层次上可以将报文丢弃,从而减少对内存的使用。对于较大的BDP可以增大这些值(注意,其单位是内存页而不是字节)。 | 5814 7754 11628 | | ||
`net.ipv4.tcp_rmem` | 为自动调优定义socket使用的内存。第一个值是为socket接收缓冲区分配的最少字节数;第二个值是默认值(该值会被`rmem_default`覆盖),缓冲区在系统负载不重的情况下可以增长到这个值;第三个值是接收缓冲区空间的最大字节数(该值会被`rmem_max`覆盖)。 | 4096 87380 3970528 | | ||
`net.ipv4.tcp_wmem` | 为自动调优定义socket使用的内存。第一个值是为socket发送缓冲区分配的最少字节数;第二个值是默认值(该值会被`wmem_default`覆盖),缓冲区在系统负载不重的情况下可以增长到这个值;第三个值是发送缓冲区空间的最大字节数(该值会被`wmem_max`覆盖)。 | 4096 16384 3970528 | | ||
`net.ipv4.tcp_keepalive_time` | TCP发送keepalive探测消息的间隔时间(秒),用于确认TCP连接是否有效。 | 7200 | 1800 | ||
`net.ipv4.tcp_keepalive_intvl` | 探测消息未获得响应时,重发该消息的间隔时间(秒) | 75 | 30 | ||
`net.ipv4.tcp_keepalive_probes` | 在认定TCP连接失效之前,最多发送多少个keepalive探测消息。 | 9 | 3 | ||
`net.ipv4.tcp_sack` | 启用有选择的应答(1表示启用),通过有选择地应答乱序接收到的报文来提高性能,让发送者只发送丢失的报文段,(对于广域网通信来说)这个选项应该启用,但是会增加对CPU的占用。 | 1 | 1 | ||
`net.ipv4.tcp_fack` | 启用转发应答,可以进行有选择应答(SACK)从而减少拥塞情况的发生,这个选项也应该启用。 | 1 | 1 | ||
`net.ipv4.tcp_timestamps` | TCP时间戳(会在TCP包头增加12个字节),以一种比重发超时更精确的方法(参考RFC 1323)来启用对RTT 的计算,为实现更好的性能应该启用这个选项。 | 1 | 1 | ||
`net.ipv4.tcp_window_scaling` | 启用RFC 1323定义的window scaling,要支持超过64KB的TCP窗口,必须启用该值(1表示启用),TCP窗口最大至1GB,TCP连接双方都启用时才生效。 | 1 | 1 | ||
`net.ipv4.tcp_syncookies` | 表示是否打开TCP同步标签(syncookie),内核必须打开了`CONFIG_SYN_COOKIES`项进行编译,同步标签可以防止一个套接字在有过多试图连接到达时引起过载。 | 1 | 1 | ||
`net.ipv4.tcp_tw_reuse` | 表示是否允许将处于TIME-WAIT状态的socket(TIME-WAIT的端口)用于新的TCP连接 。 | 0 | 1 | ||
`net.ipv4.tcp_tw_recycle` | 能够更快地回收TIME-WAIT套接字。 | 0 | 1 | ||
`net.ipv4.tcp_fin_timeout` | 对于本端断开的socket连接,TCP保持在FIN-WAIT-2状态的时间(秒)。对方可能会断开连接或一直不结束连接或不可预料的进程死亡。 | 60 | 30 | ||
`net.ipv4.ip_local_port_range` | 表示TCP/UDP协议允许使用的本地端口号 | 32768 60999 | 1024 65000 | ||
`net.ipv4.tcp_max_syn_backlog` | 对于还未获得对方确认的连接请求,可保存在队列中的最大数目。如果服务器经常出现过载,可以尝试增加这个数字。 | 128 | | ||
`net.ipv4.tcp_low_latency` | 允许TCP/IP栈适应在高吞吐量情况下低延时的情况,这个选项应该禁用。 | 0 | 0 | ||
|
||
## ARP相关 | ||
|
||
### ARP回收 | ||
|
||
- `gc_stale_time` 每次检查neighbour记录的有效性的周期。当neighbour记录失效时,将在给它发送数据前再解析一次。缺省值是60秒。 | ||
- `gc_thresh1` 存在于ARP高速缓存中的最少记录数,如果少于这个数,垃圾收集器将不会运行。缺省值是128。 | ||
- `gc_thresh2` 存在 ARP 高速缓存中的最多的记录软限制。垃圾收集器在开始收集前,允许记录数超过这个数字 5 秒。缺省值是 512。 | ||
- `gc_thresh3` 保存在 ARP 高速缓存中的最多记录的硬限制,一旦高速缓存中的数目高于此,垃圾收集器将马上运行。缺省值是1024。 | ||
|
||
比如可以增大为 | ||
|
||
``` | ||
net.ipv4.neigh.default.gc_thresh1=1024 | ||
net.ipv4.neigh.default.gc_thresh2=4096 | ||
net.ipv4.neigh.default.gc_thresh3=8192 | ||
``` | ||
|
||
### ARP过滤 | ||
|
||
arp_filter - BOOLEAN | ||
|
||
1 - Allows you to have multiple network interfaces on the same | ||
subnet, and have the ARPs for each interface be answered | ||
based on whether or not the kernel would route a packet from | ||
the ARP'd IP out that interface (therefore you must use source | ||
based routing for this to work). In other words it allows control | ||
of which cards (usually 1) will respond to an arp request. | ||
|
||
0 - (default) The kernel can respond to arp requests with addresses | ||
from other interfaces. This may seem wrong but it usually makes | ||
sense, because it increases the chance of successful communication. | ||
IP addresses are owned by the complete host on Linux, not by | ||
particular interfaces. Only for more complex setups like load- | ||
balancing, does this behaviour cause problems. | ||
|
||
arp_filter for the interface will be enabled if at least one of | ||
conf/{all,interface}/arp_filter is set to TRUE, | ||
it will be disabled otherwise | ||
|
||
arp_announce - INTEGER | ||
|
||
Define different restriction levels for announcing the local | ||
source IP address from IP packets in ARP requests sent on | ||
interface: | ||
0 - (default) Use any local address, configured on any interface | ||
1 - Try to avoid local addresses that are not in the target's | ||
subnet for this interface. This mode is useful when target | ||
hosts reachable via this interface require the source IP | ||
address in ARP requests to be part of their logical network | ||
configured on the receiving interface. When we generate the | ||
request we will check all our subnets that include the | ||
target IP and will preserve the source address if it is from | ||
such subnet. If there is no such subnet we select source | ||
address according to the rules for level 2. | ||
2 - Always use the best local address for this target. | ||
In this mode we ignore the source address in the IP packet | ||
and try to select local address that we prefer for talks with | ||
the target host. Such local address is selected by looking | ||
for primary IP addresses on all our subnets on the outgoing | ||
interface that include the target IP address. If no suitable | ||
local address is found we select the first local address | ||
we have on the outgoing interface or on all other interfaces, | ||
with the hope we will receive reply for our request and | ||
even sometimes no matter the source IP address we announce. | ||
|
||
The max value from conf/{all,interface}/arp_announce is used. | ||
|
||
Increasing the restriction level gives more chance for | ||
receiving answer from the resolved target while decreasing | ||
the level announces more valid sender's information. | ||
|
||
arp_ignore - INTEGER | ||
|
||
Define different modes for sending replies in response to | ||
received ARP requests that resolve local target IP addresses: | ||
0 - (default): reply for any local target IP address, configured | ||
on any interface | ||
1 - reply only if the target IP address is local address | ||
configured on the incoming interface | ||
2 - reply only if the target IP address is local address | ||
configured on the incoming interface and both with the | ||
sender's IP address are part from same subnet on this interface | ||
3 - do not reply for local addresses configured with scope host, | ||
only resolutions for global and link addresses are replied | ||
4-7 - reserved | ||
8 - do not reply for all local addresses | ||
|
||
The max value from conf/{all,interface}/arp_ignore is used | ||
when ARP request is received on the {interface} | ||
|
||
arp_notify - BOOLEAN | ||
|
||
Define mode for notification of address and device changes. | ||
0 - (default): do nothing | ||
1 - Generate gratuitous arp requests when device is brought up | ||
or hardware address changes. | ||
|
||
arp_accept - BOOLEAN | ||
|
||
Define behavior for gratuitous ARP frames who's IP is not | ||
already present in the ARP table: | ||
0 - don't create new entries in the ARP table | ||
1 - create new entries in the ARP table | ||
|
||
Both replies and requests type gratuitous arp will trigger the | ||
ARP table to be updated, if this setting is on. | ||
|
||
If the ARP table already contains the IP address of the | ||
gratuitous arp frame, the arp table will be updated regardless | ||
if this setting is on or off. | ||
|
||
## 参考文档 | ||
|
||
- [Linux Kernel ip sysctl documentation](https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt) |
Oops, something went wrong.