📝 docs(1.0): 完善Linux相关的XDP，工具和内核网络参数文章

Signed-off-by: Tony Deng <wolf.deng@gmail.com>
tonydeng · Nov 30, 2017 · 296bee2 · 296bee2
1 parent 6c92215
commit 296bee2
Show file tree

Hide file tree

Showing 10 changed files with 623 additions and 2 deletions.
diff --git a/SUMMARY.md b/SUMMARY.md
@@ -27,13 +27,13 @@
   - [eBPF](linux/bpf/README.md)
     - [bcc](linux/bpf/bcc.md)
     - [故障排查](linux/bpf/troubleshooting.md)
-  - [XDP](linux/XDP/index.md)
+  - [XDP](linux/XDP/README.md)
     - [XDP架构](linux/XDP/design.md)
     - [使用场景](linux/XDP/use-cases.md)
   - [常用工具](linux/tools.md)
     - [网络抓包tcpdump](linux/tcpdump.md)
     - [scapy](linux/scapy.md)
-  - [内核网络参数](linux/params.md)
+  - [内核网络参数](linux/kernel-network-params.md)
 - [4. Open vSwitch](ovs/README.md)
   - [OVS介绍](ovs/README.md)
   - [OVS编译](ovs/build.md)

diff --git a/linux/XDP/README.md b/linux/XDP/README.md
@@ -0,0 +1,49 @@
+# XDP
+
+XDP（eXpress Data Path）为Linux内核提供了高性能、可编程的网络数据路径。由于网络包在还未进入网络协议栈之前就处理，它给Linux网络带来了巨大的性能提升（性能比DPDK还要高）。
+
+![xdp-packet-processing](images/xdp-packet-processing-1024x560.png)
+
+XDP主要的特性包括
+
+- 在网络协议栈前处理
+- 无锁设计
+- 批量I/O操作
+- 轮询式
+- 直接队列访问
+- 不需要分配skbuff
+- 支持网络卸载
+- DDIO
+- XDP程序快速执行并结束，没有循环
+- Packeting steering
+
+## 与DPDK对比
+
+相对于DPDK，XDP具有以下优点
+
+- 无需第三方代码库和许可
+- 同时支持轮询式和中断式网络
+- 无需分配大页
+- 无需专用的CPU
+- 无需定义新的安全网络模型
+
+## 示例
+
+- [Linux内核BPF示例](https://github.com/torvalds/linux/tree/master/samples/bpf)
+- [prototype-kernel示例](https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf)
+- [libbpf](https://github.com/torvalds/linux/tree/master/tools/lib/bpf)
+
+## 缺点
+
+注意XDP的性能提升是有代价的，它牺牲了通用型和公平性
+
+- XDP不提供缓存队列（qdisc），TX设备太慢时直接丢包，因而不要在RX比TX快的设备上使用XDP
+- XDP程序是专用的，不具备网络协议栈的通用性
+
+## 参考
+
+- [Introduction to XDP](https://www.iovisor.org/technology/xdp)
+- [Network Performance BoF](http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html)
+- [XDP Introduction and Use-cases](http://people.netfilter.org/hawk/presentations/xdp2016/xdp_intro_and_use_cases_sep2016.pdf)
+- [Linux Network Stack](http://people.netfilter.org/hawk/presentations/theCamp2016/theCamp2016_next_steps_for_linux.pdf)
+- [NetDev 1.2 video](https://www.youtube.com/watch?v=NlMQ0i09HMU&feature=youtu.be&t=3m3s)
diff --git a/linux/XDP/design.md b/linux/XDP/design.md
@@ -0,0 +1,22 @@
+# XDP架构
+
+XDP基于一系列的技术来实现高性能和可编程性，包括
+
+- 基于eBPF
+- Capabilities negotiation：通过协商确定网卡驱动支持的特性，XDP尽量利用新特性，但网卡驱动不需要支持所有的特性
+- 在网络协议栈前处理
+- 无锁设计
+- 批量I/O操作
+- 轮询式
+- 直接队列访问
+- 不需要分配skbuff
+- 支持网络卸载
+- DDIO
+- XDP程序快速执行并结束，没有循环
+- Packeting steering
+
+## 包处理逻辑
+
+如下图所示，基于内核的eBPF程序处理包，每个RX队列分配一个CPU，且以每个网络包一个Page（packet-page）的方式避免分配skbuff。
+
+![xdp-packet-processor](images/packet-processor.png)
diff --git a/linux/XDP/images/packet-processor.png b/linux/XDP/images/packet-processor.png
diff --git a/linux/XDP/images/xdp-packet-processing-1024x560.png b/linux/XDP/images/xdp-packet-processing-1024x560.png
diff --git a/linux/XDP/use-cases.md b/linux/XDP/use-cases.md
@@ -0,0 +1,10 @@
+# XDP使用场景
+
+XDP的使用场景包括
+
+- DDoS防御
+- 防火墙
+- 基于`XDP_TX`的负载均衡
+- 网络统计
+- 复杂网络采样
+- 高速交易平台
diff --git a/linux/kernel-network-params.md b/linux/kernel-network-params.md
@@ -0,0 +1,201 @@
+# 内核中的网络参数
+
+## nf_conntrack
+
+`nf_conntrack`是Linux内核连接跟踪的模块，常用在`iptables`中，比如
+
+```
+-A INPUT -m state --state RELATED,ESTABLISHED  -j RETURN
+-A INPUT -m state --state INVALID -j DROP
+```
+
+可以通过`cat /proc/net/nf_conntrack`来查看当前跟踪的连接信息，这些信息以哈希形式（用链地址法处理冲突）存在内存中，并且每条记录大约占300B空间。
+
+与`nf_conntrack`相关的内核参数有三个：
+
+- `nf_conntrack_max`：连接跟踪表的大小，建议根据内存计算该值`CONNTRACK_MAX = RAMSIZE (in bytes) / 16384 / (x / 32)`，并满足`nf_conntrack_max=4*nf_conntrack_buckets`，默认262144
+- `nf_conntrack_buckets`：哈希表的大小，(`nf_conntrack_max/nf_conntrack_buckets`就是每条哈希记录链表的长度)，默认65536
+- `nf_conntrack_tcp_timeout_established`：tcp会话的超时时间，默认是432000 (5天)
+
+比如，对64G内存的机器，推荐配置：
+
+```
+net.netfilter.nf_conntrack_max=4194304
+net.netfilter.nf_conntrack_tcp_timeout_established=300
+net.netfilter.nf_conntrack_buckets=1048576
+```
+
+## bridge-nf
+
+bridge-nf使得netfilter可以对Linux网桥上的IPv4/ARP/IPv6包过滤。比如，设置`net.bridge.bridge-nf-call-iptables＝1`后，二层的网桥在转发包时也会被iptables的FORWARD规则所过滤，这样有时会出现L3层的iptables rules去过滤L2的帧的问题（见[这里](https://bugzilla.redhat.com/show_bug.cgi?id=512206)）。
+
+常用的选项包括
+
+- `net.bridge.bridge-nf-call-arptables`：是否在`arptables`的FORWARD中过滤网桥的ARP包
+- `net.bridge.bridge-nf-call-ip6tables`：是否在`ip6tables`链中过滤IPv6包
+- `net.bridge.bridge-nf-call-iptables`：是否在`iptables`链中过滤IPv4包
+- `net.bridge.bridge-nf-filter-vlan-tagged`：是否在i`ptables/arptables`中过滤打了vlan标签的包
+
+当然，也可以通过`/sys/devices/virtual/net/<bridge-name>/bridge/nf_call_iptables`来设置，但要注意内核是取两者中大的生效。
+
+有时，可能只希望部分网桥禁止bridge-nf，而其他网桥都开启（比如CNI网络插件中一般要求bridge-nf-call-iptables选项开启，而有时又希望禁止某个网桥的bridge-nf），这时可以改用iptables的方法：
+
+```sh
+iptables -t raw -I PREROUTING -i <bridge-name> -j NOTRACK
+```
+
+## 反向路径过滤
+
+反向路径过滤可用于防止数据包从一接口传入，又从另一不同的接口传出（这有时被称为 “非对称路由” ）。除非必要，否则最好将其关闭，因为它可防止来自子网络的用户采用 IP 地址欺骗手段，并减少 DDoS （分布式拒绝服务）攻击的机会。
+
+通过 rp_filter 选项启用反向路径过滤，比如 `sysctl -w net.ipv4.conf.default.rp_filter=INTEGER`。支持三种选项：
+
+- 0 ——未进行源验证。
+- 1 ——处于如 RFC3704 所定义的严格模式。
+- 2 ——处于如 RFC3704 所定义的松散模式。
+
+可以通过 `net.ipv4.interface.rp_filter`可实现对每一网络接口设置的覆盖。
+
+## TCP相关
+
+**参数**| **描述**| **默认值**| **优化值**
+-------| ------- | -------- | --------
+`net.core.rmem_default` | 默认的TCP数据接收窗口大小（字节）| 212992|
+`net.core.rmem_max` | 最大的TCP数据接收窗口（字节）。     | 212992 |
+`net.core.wmem_default` | 默认的TCP数据发送窗口大小（字节）。 | 212992   |
+`net.core.wmem_max` | 最大的TCP数据发送窗口（字节）。     | 212992   |
+`net.core.netdev_max_backlog` | 在每个网络接口接收数据包的速率比内核处理这些包的速率快时，允许送到队列的数据包的最大数目。| 1000 | 10000
+`net.core.somaxconn` | 定义了系统中每一个端口最大的监听队列的长度，这是个全局的参数。 | 128 | 2048
+`net.core.optmem_max`| 表示每个套接字所允许的最大缓冲区的大小。 | 20480 | 81920
+`net.ipv4.tcp_mem` | 确定TCP栈应该如何反映内存使用，每个值的单位都是内存页（通常是4KB）。第一个值是内存使用的下限；第二个值是内存压力模式开始对缓冲区使用应用压力的上限；第三个值是内存使用的上限。在这个层次上可以将报文丢弃，从而减少对内存的使用。对于较大的BDP可以增大这些值（注意，其单位是内存页而不是字节）。 | 5814 7754 11628 |
+`net.ipv4.tcp_rmem` | 为自动调优定义socket使用的内存。第一个值是为socket接收缓冲区分配的最少字节数；第二个值是默认值（该值会被`rmem_default`覆盖），缓冲区在系统负载不重的情况下可以增长到这个值；第三个值是接收缓冲区空间的最大字节数（该值会被`rmem_max`覆盖）。    | 4096  87380  3970528  |
+`net.ipv4.tcp_wmem` | 为自动调优定义socket使用的内存。第一个值是为socket发送缓冲区分配的最少字节数；第二个值是默认值（该值会被`wmem_default`覆盖），缓冲区在系统负载不重的情况下可以增长到这个值；第三个值是发送缓冲区空间的最大字节数（该值会被`wmem_max`覆盖）。    | 4096  16384  3970528  |
+`net.ipv4.tcp_keepalive_time` | TCP发送keepalive探测消息的间隔时间（秒），用于确认TCP连接是否有效。 | 7200     | 1800
+`net.ipv4.tcp_keepalive_intvl`  | 探测消息未获得响应时，重发该消息的间隔时间（秒） | 75 | 30
+`net.ipv4.tcp_keepalive_probes` | 在认定TCP连接失效之前，最多发送多少个keepalive探测消息。 | 9 | 3
+`net.ipv4.tcp_sack` | 启用有选择的应答（1表示启用），通过有选择地应答乱序接收到的报文来提高性能，让发送者只发送丢失的报文段，（对于广域网通信来说）这个选项应该启用，但是会增加对CPU的占用。 | 1 | 1
+`net.ipv4.tcp_fack` | 启用转发应答，可以进行有选择应答（SACK）从而减少拥塞情况的发生，这个选项也应该启用。 | 1 | 1
+`net.ipv4.tcp_timestamps` | TCP时间戳（会在TCP包头增加12个字节），以一种比重发超时更精确的方法（参考RFC 1323）来启用对RTT 的计算，为实现更好的性能应该启用这个选项。 | 1 | 1
+`net.ipv4.tcp_window_scaling`   | 启用RFC 1323定义的window scaling，要支持超过64KB的TCP窗口，必须启用该值（1表示启用），TCP窗口最大至1GB，TCP连接双方都启用时才生效。 | 1 | 1
+`net.ipv4.tcp_syncookies` | 表示是否打开TCP同步标签（syncookie），内核必须打开了`CONFIG_SYN_COOKIES`项进行编译，同步标签可以防止一个套接字在有过多试图连接到达时引起过载。 | 1 | 1
+`net.ipv4.tcp_tw_reuse` | 表示是否允许将处于TIME-WAIT状态的socket（TIME-WAIT的端口）用于新的TCP连接 。 | 0 | 1
+`net.ipv4.tcp_tw_recycle` | 能够更快地回收TIME-WAIT套接字。 | 0 | 1
+`net.ipv4.tcp_fin_timeout` | 对于本端断开的socket连接，TCP保持在FIN-WAIT-2状态的时间（秒）。对方可能会断开连接或一直不结束连接或不可预料的进程死亡。  | 60 | 30
+`net.ipv4.ip_local_port_range`  | 表示TCP/UDP协议允许使用的本地端口号 | 32768  60999 | 1024  65000
+`net.ipv4.tcp_max_syn_backlog`  | 对于还未获得对方确认的连接请求，可保存在队列中的最大数目。如果服务器经常出现过载，可以尝试增加这个数字。 | 128 |
+`net.ipv4.tcp_low_latency` | 允许TCP/IP栈适应在高吞吐量情况下低延时的情况，这个选项应该禁用。 | 0 | 0
+
+## ARP相关
+
+### ARP回收
+
+- `gc_stale_time` 每次检查neighbour记录的有效性的周期。当neighbour记录失效时，将在给它发送数据前再解析一次。缺省值是60秒。
+- `gc_thresh1` 存在于ARP高速缓存中的最少记录数，如果少于这个数，垃圾收集器将不会运行。缺省值是128。
+- `gc_thresh2` 存在 ARP 高速缓存中的最多的记录软限制。垃圾收集器在开始收集前，允许记录数超过这个数字 5 秒。缺省值是 512。
+- `gc_thresh3` 保存在 ARP 高速缓存中的最多记录的硬限制，一旦高速缓存中的数目高于此，垃圾收集器将马上运行。缺省值是1024。
+
+比如可以增大为
+
+```
+net.ipv4.neigh.default.gc_thresh1=1024
+net.ipv4.neigh.default.gc_thresh2=4096
+net.ipv4.neigh.default.gc_thresh3=8192
+```
+
+### ARP过滤
+
+arp_filter - BOOLEAN
+
+    1 - Allows you to have multiple network interfaces on the same
+    subnet, and have the ARPs for each interface be answered
+    based on whether or not the kernel would route a packet from
+    the ARP'd IP out that interface (therefore you must use source
+    based routing for this to work). In other words it allows control
+    of which cards (usually 1) will respond to an arp request.
+
+    0 - (default) The kernel can respond to arp requests with addresses
+    from other interfaces. This may seem wrong but it usually makes
+    sense, because it increases the chance of successful communication.
+    IP addresses are owned by the complete host on Linux, not by
+    particular interfaces. Only for more complex setups like load-
+    balancing, does this behaviour cause problems.
+
+    arp_filter for the interface will be enabled if at least one of
+    conf/{all,interface}/arp_filter is set to TRUE,
+    it will be disabled otherwise
+
+arp_announce - INTEGER
+
+    Define different restriction levels for announcing the local
+    source IP address from IP packets in ARP requests sent on
+    interface:
+    0 - (default) Use any local address, configured on any interface
+    1 - Try to avoid local addresses that are not in the target's
+    subnet for this interface. This mode is useful when target
+    hosts reachable via this interface require the source IP
+    address in ARP requests to be part of their logical network
+    configured on the receiving interface. When we generate the
+    request we will check all our subnets that include the
+    target IP and will preserve the source address if it is from
+    such subnet. If there is no such subnet we select source
+    address according to the rules for level 2.
+    2 - Always use the best local address for this target.
+    In this mode we ignore the source address in the IP packet
+    and try to select local address that we prefer for talks with
+    the target host. Such local address is selected by looking
+    for primary IP addresses on all our subnets on the outgoing
+    interface that include the target IP address. If no suitable
+    local address is found we select the first local address
+    we have on the outgoing interface or on all other interfaces,
+    with the hope we will receive reply for our request and
+    even sometimes no matter the source IP address we announce.
+
+    The max value from conf/{all,interface}/arp_announce is used.
+
+    Increasing the restriction level gives more chance for
+    receiving answer from the resolved target while decreasing
+    the level announces more valid sender's information.
+
+arp_ignore - INTEGER
+
+    Define different modes for sending replies in response to
+    received ARP requests that resolve local target IP addresses:
+    0 - (default): reply for any local target IP address, configured
+    on any interface
+    1 - reply only if the target IP address is local address
+    configured on the incoming interface
+    2 - reply only if the target IP address is local address
+    configured on the incoming interface and both with the
+    sender's IP address are part from same subnet on this interface
+    3 - do not reply for local addresses configured with scope host,
+    only resolutions for global and link addresses are replied
+    4-7 - reserved
+    8 - do not reply for all local addresses
+
+    The max value from conf/{all,interface}/arp_ignore is used
+    when ARP request is received on the {interface}
+
+arp_notify - BOOLEAN
+
+    Define mode for notification of address and device changes.
+    0 - (default): do nothing
+    1 - Generate gratuitous arp requests when device is brought up
+        or hardware address changes.
+
+arp_accept - BOOLEAN
+
+    Define behavior for gratuitous ARP frames who's IP is not
+    already present in the ARP table:
+    0 - don't create new entries in the ARP table
+    1 - create new entries in the ARP table
+
+    Both replies and requests type gratuitous arp will trigger the
+    ARP table to be updated, if this setting is on.
+
+    If the ARP table already contains the IP address of the
+    gratuitous arp frame, the arp table will be updated regardless
+    if this setting is on or off.
+
+## 参考文档
+
+- [Linux Kernel ip sysctl documentation](https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt)