Skip to content

Commit

Permalink
docs: add underlay and overlay network coexist
Browse files Browse the repository at this point in the history
  • Loading branch information
cyclinder committed May 15, 2024
1 parent 2990d1f commit e19a176
Show file tree
Hide file tree
Showing 5 changed files with 205 additions and 1 deletion.
Binary file added docs/images/underlay_overlay_cni.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ nav:
- Multi-Cluster Networking: usage/submariner.md
- Access Service for Underlay CNI: usage/underlay_cni_service.md
- Bandwidth Manage for IPVlan CNI: usage/ipvlan_bandwidth.md
- Coexistence of multi CNIs: usage/multi_cni_coexist.md
- Kubevirt: usage/kubevirt.md
- FAQ: usage/faq.md
- Reference:
Expand Down
2 changes: 1 addition & 1 deletion docs/usage/install/overlay/get-started-calico-zh_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
```shell
~# helm repo add spiderpool https://spidernet-io.github.io/spiderpool
~# helm repo update spiderpool
~# helm install spiderpool spiderpool/spiderpool --namespace kube-system --set coordinator.mode=overlay --wait
~# helm install spiderpool spiderpool/spiderpool --namespace kube-system --wait
```

> 如果您的集群未安装 Macvlan CNI, 可指定 Helm 参数 `--set plugins.installCNI=true` 安装 Macvlan 等 CNI 到每个节点。
Expand Down
101 changes: 101 additions & 0 deletions docs/usage/multi_cni_coexist-zh_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# 多 CNI 共存与一个集群

## 背景

CNI 作为 Kubernetes 的集群中重要的组件。一般情况下,都会部署一个 CNI(比如 Calico),由它负责集群网络的连通性。有些时候客户有一些基于性能、安全等的考虑,会在集群中使用多种类型的 CNI,比如 Underlay 类型的 Macvlan CNI。此时一个集群就可能存在多种 CNI 类型的 Pod,不同类型的 Pod 分别适用于不同的场景:

* 单 Calico 网卡的 Pod: 如 CoreDNS 等系统组件,没有固定 IP 的需求,也不需要南北向流量通信,只存在集群东西向流量通信的需求。
* 单 Macvlan 网卡的 Pod: 适用于对性能,安全有特殊要求的应用,或需要以 Pod IP 直接南北向流量通信的传统上云应用。
* Calico 和 Macvlan 网卡的多网卡 Pod:同时兼顾上面二者的需求。既需要以固定的 Pod IP 访问集群南北向流量,又需要访问集群东西向流量(比如和 Calico Pod 或 Service)。

当这三种类型的 Pod 存在于一个集群,实际上这个集群存在两种不同的数据转发方案: Underlay 和 Overlay。这可能会导致一些其他问题:

* 使用 Underlay 网络的 Pod 无法与集群中使用 Overlay 网络的 Pod 直接通信: 由于转发路径不一致,Overlay 网络常常需要经过节点作二次转发,但 Underlay 一般直接通过底层网关转发。所以当二者互相访问时,可能由于底层交换机未同步集群子网的路由导致丢包
* 一套集群使用两种网络模式可能会增加使用和运维复杂度,比如 IP 地址管理等

Spiderpool 这一套完整的 Underlay 网络解决方案可以解决当集群存在多种 CNI 时的互联互通问题,又可以减轻 IP 地址运维负担。下面我们将介绍它们之间的数据转发流程。

## 快速开始

* Calico + Macvlan 多网卡快速开始可参考 [get-stared-calico](./install/overlay/get-started-calico-zh_cn.md)
* 单 Macvlan 网卡可参考 [get-started-macvlan](./install/underlay/get-started-macvlan-zh_CN.md)
* Underlay CNI 访问 Service 可参考 [underlay_cni_service](./underlay_cni_service-zh_CN.md)

## 数据转发流程

![dataplane](../images/underlay_overlay_cni.png)

下面介绍几种典型的通信场景:

### Calico Pod 访问 Calico 和 Macvlan 多网卡的 Pod

如上图中 `calico access to calico-macvlan pod` 访问路径所示:

1. 数据包首先从 Calico Pod(10.233.100.1) 经过 calixxx 虚拟网卡转发到节点,经过节点间的路由,转发到目标主机。
2. 不管访问的是目标 Pod 的 Calico 网卡(10.233.100.2)还是 Macvlan 网卡 IP(10.7.200.1),都会经过 Pod 对应的 calixxx 虚拟网卡转发到 Pod 中。

由于 Macvlan bridge 模式的限制,master 父子接口之间无法直接通信,所以导致节点无法直接访问 pod 的 macvlan IP, spiderpool 会在节点为 Pod 的 macvlan 网卡注入一条通过 calixxx 转发 macvlan 父子接口通信的路由。

3. calico-macvlan pod 发起回复报文时,由于目标 Pod 的 IP 为: 10.233.100.1, 10.233.64.0/18 是 calico 子网,这意味着所有访问 calico 子网目标会从 eth0 转发到主机。

~# kubectl exec -it calico-macvlan-556dddfdb-4ctxv -- ip r
10.233.64.0/18 via 10.7.168.71 dev eth0 src 10.233.100.2

4. 由于回复报文的目标 IP 为 10.233.100.1,所以会匹配 calico 子网的隧道路由再转发到目标节点。最后通过目标 Pod 对应的 calixxx 虚拟网卡,转发到目标 Pod,整个访问结束。

### Calico+Macvlan 多网卡的 Pod 访问 Calico Pod 的 Service

如上图中 `access to calico pod's service` 访问路径所示:

1. Spiderpool 会为 Pod 注入访问 Service 的数据包都从 eth0 转发的路由。所以数据包首先匹配 Pod 内部的 ServiceCIDR 路由从 eth0 转发到节点。如下 10.233.0.0/18 是 service的子网:

~# kubectl exec -it calico-macvlan-556dddfdb-4ctxv -- ip r
10.233.0.0/18 via 10.7.168.71 dev eth0 src 10.233.100.2

2. 经过主机网络协议栈的 kube-proxy 解析目标 service 地址为 calico pod 的 IP: 10.233.100.1, 随后通过节点间的路由转发到目标节点,最后通过目标 Pod 对应的 calixxx 虚拟网卡,转发到目标 Pod。

### Macvlan Pod 访问 Calico Pod

如上图 `macvlan access to calico pod` 所示:

1. Spiderpool 会在 Pod 内部注入通往 calico 子网通过 veth0 转发的路由。如下 10.233.64.0/18 是 calico 子网,该路由确保 Pod 将访问 calico 子网的流量通过 veth0 转发到主机。

~# kubectl exec -it macvlan-76c49c7bfb-82fnt -- ip r
10.233.64.0/18 via 10.7.168.71 dev veth0 src 10.7.200.2

2. 转发到主机之后,由于目标地址是 10.233.100.1,所以数据包通过 calico 的隧道路由转发到目标主机上,再通过目标 Pod 对应的 calixxx 虚拟网卡转发到目标 Pod。
3. 但目标 Calico Pod 在发出回复报文时,此时的目标 IP 为 10.7.200.2,于是直接将数据包转发到目标 Pod,而不会经过节点转发。由于来回转发路径不一致,可能会被内核认为数据包的 conntrack 的 state 为 invalid,会被 kube-proxy 的一条 iptables 规则丢弃:

~# iptables-save -t filter | grep '--ctstate INVALID -j DROP'
iptables -A FORWARD -m conntrack --ctstate INVALID -j DROP

该规则原是为了解决 [#Issue 74839](https://github.com/kubernetes/kubernetes/issues/74839) 提出的问题,因为 某些 tcp 报文大小超出窗口限制,导致被内核标记其 conntrack 的 state 为 invalid,从而导致整个 tcp 链接被 reset。于是 k8s 社区通过下发这条规则来解决这个问题,但这条规则可能会影响此场景中数据包来回不一致的情况。如社区有相关的 issue 报告:[#Issue 117924](https://github.com/kubernetes/kubernetes/issues/117924), [#Issue 94861](https://github.com/kubernetes/kubernetes/issues/94861),[#Issue 177](https://github.com/spidernet-io/cni-plugins/issues/177)等。

我们通过推动了社区修复此问题,最终在 [only drop invalid cstate packets if non liberal](https://github.com/kubernetes/kubernetes/pull/120412) 得到解决,kubernetes 版本为 v1.29。我们需要确保设置每个节点的 sysctl 参数: `sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1`,并且重启 Kube-proxy,这样 kube-proxy 就不会下发这条 drop 规则,也就不会影响到单 Macvlan pod 与单 Calico pod 之间的通信。

执行完毕后,检查节点是否还存在这条 drop 规则,如果没有输出说明正常。否则请检查 sysctl 是否正确设置以及是否重启 kube-proxy。

~# iptables-save -t filter | grep '--ctstate INVALID -j DROP'

注意: 必须确保 k8s 版本大于 v1.29。如果您的 k8s 版本小于 v1.29, 那么这条规则将会影响 Macvlan Pod 与 Calico Pod 之间的访问。

### Macvlan Pod 访问 Calico Pod 的 Service

参考上图 `macvlan access to calico pod/service` 所示:

1. Spiderpool 会在 Macvlan Pod 内注入一条如下的路由, 期望 Pod 访问 Service 的时候,数据包通过 veth0 网卡转发到节点。10.233.0.0/18 为 Service 的子网,当 Macvlan Pod 访问 Service 的时候,数据包会通过 veth0 网卡转发到节点。

~# ip r
10.233.0.0/18 via 10.7.168.71 dev eth0 src 10.233.100.2

2. 经过主机网络协议栈时,经过 kube-proxy 将目标地址转换为 Calico Pod 的 IP, 随后通过 Calico 设置的隧道路由转发到目标主机。注意当数据包从源主机发出时,其源地址会被 SNAT 为源主机的 IP,所以这确保目标主机收到数据包时,能够将数据包原路返回,而不会出现上个场景的来回路径不一致的问题。

## 结论

我们总结了这三种类型的 Pod 存在于一个集群时的一些通信场景,如下:

| 源\目标 | Calico Pod | Macvlan Pod | Calico + Macvlan 多网卡 Pod | Calico Pod 的 Service | Macvlan Pod 的 Service | Calico + Macvlan 多网卡 Pod 的 Service |
|-|-|-|-|-|-|-|
| Calico Pod |||||||
| Macvlan Pod | 受 kube-proxy 的 drop 规则影响 ||||||
| Calico + Macvlan 多网卡 Pod |||||||
102 changes: 102 additions & 0 deletions docs/usage/multi_cni_coexist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Multi-CNI Coexistence with a Cluster

## Background

CNIs are important components of a Kubernetes cluster. Typically, one CNI (e.g. Calico) is deployed and is responsible for the connectivity of the cluster network. In some cases, customers may use multiple types of CNIs in the cluster based on performance, security, etc., such as Macvlan CNIs of the Underlay type, and then there may be multiple Pods of different CNI types in a cluster, and different types of Pods are suitable for different scenarios:

* Pod with a single Calico NIC: System components such as CoreDNS do not have the need for a fixed IP, nor do they need to communicate with north-south traffic, but only need to communicate with east-west traffic in the cluster.
* Pods with a single Macvlan card: For applications with special requirements for performance and security, or for traditional up-cloud applications that require direct north-south traffic with the Pod IP.
* Multi-Card Pod with Calico and Macvlan cards: A combination of both. Both need to access cluster north-south traffic with a fixed Pod IP and cluster east-west traffic (e.g., with a Calico Pod or Service).

When these three types of Pods exist in a cluster, there are actually two different data forwarding schemes for the cluster: Underlay and Overlay. this can lead to a number of other problems:

* Pods using the Underlay network cannot access the cluster's north-south traffic.
* Pods using Underlay networks cannot communicate directly with Pods using Overlay networks in the cluster: Due to inconsistent forwarding paths, Overlay networks often need to go through nodes for secondary forwarding, whereas Underlay networks are generally forwarded directly through the underlying gateway. Therefore, when they access each other, packet loss may occur because the underlying switch does not synchronize the routes of the cluster subnet.
* Using two network modes for a cluster may increase the complexity of use and operation, such as IP address management.

Spiderpool is a complete Underlay network solution that solves the interoperability problem when there are multiple CNIs in a cluster and reduces the IP address operation and maintenance burden. The following section describes the data forwarding process between them.

## Quick start

* Calico + Macvlan multi-network card quickstart can be found in [get-stared-calico](./install/overlay/get-started-calico.md)
* For a single Macvlan card see [get-started-macvlan](./install/underlay/get-started-macvlan.md).
* Underlay CNI Access Service See [underlay_cni_service](./underlay_cni_service.md)

## Data forwarding process

![dataplane](../images/underlay_overlay_cni.png)

Several typical communication scenarios are described below.

### Calico Pod Accessing Calico and Macvlan Multi-NIC Pods

As shown in the `calico access to calico-macvlan pod` access path in the above figure.

1. packets are first forwarded from the Calico Pod (10.233.100.1) to the node via the calixxx virtual NIC, and then routed between the nodes to the target host. 2. regardless of whether the target Pod's Calico NIC (10.233.100.2) is accessing the target pod.
2. Regardless of whether the packet is accessed from the Calico NIC (10.233.100.2) or the Macvlan NIC IP (10.7.200.1) of the target Pod, the packet will be forwarded to the Pod via the calixxx virtual NIC corresponding to the Pod.

Due to the limitation of Macvlan bridge mode, the master parent-child interface cannot communicate directly with each other, so the node cannot access the pod's macvlan IP directly. spiderpool will inject a route through calixxx for the Pod's macvlan NIC for forwarding the communication between the macvlan parent-child interface.

3. calico-macvlan pod initiates a reply packet, since the IP of the target pod is: 10.233.100.1, 10.233.64.0/18 is a calico subnet, it means that all the targets accessing the calico subnet will be forwarded from eth0 to the host.

~# kubectl exec -it calico-macvlan-556dddfdb-4ctxv -- ip r
10.233.64.0/18 via 10.7.168.71 dev eth0 src 10.233.100.2

4. Since the destination IP of the reply message is 10.233.100.1, it matches the tunnel route of the calico subnet and forwards it to the target node. Finally, it is forwarded to the target Pod through the calixxx virtual NIC corresponding to the target Pod, and the whole access is finished.

### Calico+Macvlan Multi-Card Pod Access to Calico Pod's Service

As shown in the `access to calico pod's service` access path in the above figure: 1.

1. Spiderpool injects a route from eth0 to the Pod for packets accessing the service. So the packets are first forwarded to the node from eth0 matching the ServiceCIDR route inside the Pod. The following 10.233.0.0/18 is the subnet for service:

~# kubectl exec -it calico-macvlan-556dddfdb-4ctxv -- ip r
10.233.0.0/18 via 10.7.168.71 dev eth0 src 10.233.100.2

2. After the kube-proxy of the host network stack resolves the target service address to be the IP of the calico pod: 10.233.100.1, it is then forwarded to the target node through inter-node routing, and then finally forwarded to the target pod through the calixxx virtual NIC corresponding to the target pod.

### Macvlan Pod access to Calico Pod

As shown in `macvlan access to calico pod` above:

1. Spiderpool injects a route inside the Pod to the calico subnet via veth0 forwarding. The following 10.233.64.0/18 is the calico subnet, and this route ensures that the Pod forwards traffic accessing the calico subnet via veth0 to the host.

~# kubectl exec -it macvlan-76c49c7bfb-82fnt -- ip r
10.233.64.0/18 via 10.7.168.71 dev veth0 src 10.7.200.2

2. After forwarding to the host, since the destination address is 10.233.100.1, the packet is forwarded to the target host through calico's tunnel route, and then forwarded to the target Pod through the calixxx virtual NIC corresponding to the target Pod.
3. However, when the target Calico Pod sends out the reply message, the destination IP is 10.7.200.2, so it forwards the packet directly to the target Pod without going through node forwarding. Because of this inconsistency, the kernel may consider the state of the packet's conntrack to be invalid, and it will be dropped by one of kube-proxy's iptables rules.

~# iptables-save -t filter | grep '--ctstate INVALID -j DROP'
iptables -A FORWARD -m conntrack --ctstate INVALID -j DROP

This rule was originally written to address an issue raised by [#Issue 74839](https://github.com/kubernetes/kubernetes/issues/74839), where some tcp messages exceeded the window size limit and were marked by the kernel as having an invalid conntrack state. The k8s community has resolved this issue by issuing this rule, but it may affect packet round-trip inconsistencies in this scenario. As reported in related community issues: [#Issue 117924](https://github.com/kubernetes/kubernetes/issues/117924), [#Issue 94861](https://github.com/kubernetes/) kubernetes/issues/94861), [#Issue 177](https://github.com/spidernet-io/cni-plugins/issues/177) and others.

We pushed through the community to fix this issue, which was finally resolved in [only drop invalid cstate packets if non liberal](https://github.com/kubernetes/kubernetes/pull/120412) with kubernetes We need to make sure to set the sysctl parameter: `sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1` on each node and restart Kube-proxy so that kube-proxy doesn't drop this drop rule and it doesn't affect the single Macvlan pod with a non-liberal packet. affect communication between a single Macvlan pod and a single Calico pod.

After execution, check if the drop rule still exists on the node, if there is no output, it is OK. If there is no output, then it is working. Otherwise, check that sysctl is set correctly and that kube-proxy is restarted.

~# iptables-save -t filter | grep '--ctstate INVALID -j DROP'

Note: You must make sure that the k8s version is greater than v1.29. If your k8s version is less than v1.29, then this rule will affect access between Macvlan Pod and Calico Pod.

### Macvlan Pod access to Calico Pod's Service

Refer to the `macvlan access to calico pod/service` diagram above.

1. Spiderpool injects a route into the Macvlan Pod as follows, expecting that when the Pod accesses Service, the packets will be forwarded to the node via the veth0 card. 10.233.0.0/18 is the subnet of Service, and when the Macvlan Pod accesses Service, the packets will be forwarded to the node via the veth0 card. 10.233.0.0/18 is a subnet of Service, when Macvlan Pod accesses Service, packets will be forwarded to the node through the veth0 card.

~# 10.233.0.0/18 is the subnet of Service.
10.233.0.0/18 via 10.7.168.71 dev eth0 src 10.233.100.2

2. After passing through the host's network stack, kube-proxy converts the destination address to the Calico Pod's IP, which is then forwarded to the target host through the tunnel route set by Calico. Note that when the packet is sent from the source host, its source address is SNATed to the IP of the source host, so this ensures that when the target host receives the packet, it can return the packet the way it came in without the inconsistent round-trip paths of the previous scenario.

## Conclusion

We have summarized some communication scenarios when these three types of Pods exist in a cluster as follows.

| Source\Target | Calico Pod | Macvlan Pod | Calico + Macvlan Multi-NIC Pod | Service for Calico Pod | Service for Macvlan Pod | Service for Calico + Macvlan Multi-NIC Pod |
|- |- |- |- |- |- |-|
| Calico Pod |||||||
| Macvlan Pod | affected by kube-proxy's drop rule ||||||
| Calico + Macvlan Multi NIC Pod |||||||

0 comments on commit e19a176

Please sign in to comment.