Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Egress Policy #667

Closed
jianjuns opened this issue Apr 29, 2020 · 20 comments
Closed

Egress Policy #667

jianjuns opened this issue Apr 29, 2020 · 20 comments
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. proposal A concrete proposal for adding a feature

Comments

@jianjuns
Copy link
Contributor

Describe what you are trying to solve
Egress and SNAT policies implementation of Antrea - being able to control egress Nodes and SNAT IPs of Pod egress traffic (from Pods to external network).

Describe the solution you have in mind
Just put some high level ideas here.

Egress policy definition

EgressPolicy CRD
We might introduce an EgressPolicy CRD that:

  • selects Pods, Namespaces, Services to apply the policy
    We might prioritize selecting a single Namespace or Service.
  • defines SNAT strategy, e.g.:
    • using a specified IP
    • allocating a dedicate IP from an IP pool
  • potentially supports other egress policies, e.g.:
    • egressing from a specified Node

IPPool CRD
We could add a CRD to define an IP pool. Besides SNAT, it could be used for Pod IPAM too.

NodePool CRD
We could add a CRD to define the set of Nodes that can act as egress Nodes. For simplicity, we might start from a single egress Node pool.
There could be multiple interfaces on a Node. We might need to support configuring which interface to use for egress.

Egress IP management

Discovery of Node IPs
Antrea Agent auto-discovers all interfaces and their IPs, and probably saves the information to a CRD like NodeInfo.
Then user can use any of the discovered IP to define EgressPolicy.

Auto IP assignment by Controller
Antrea Controller can automatically assign a SNAT IP to a Node from a configured NodePool.

HA and failover
When the SNAT IP is assigned to Nodes by Controller, we might further support failover of the SNAT IP - moving the SNAT IP to a new Node when the current Node fails. There could be two possible approaches:

Decision by Controller
When the Node fails, Controller should move the IP to another available Node.

To avoid conflicts of the old and new egress Nodes (e.g. the old Node loses connection to K8s API or Controller, but is still active and can serve egress traffic in datapath), we might introduce some conflict detection mechanism. For example, the new Node tries to ping the old Node (SNAT IP) to see if it is still active and reachable.

Limitations:

  • If the IP is assigned by Controller, when Controller or K8s API is down, SNAT IP can not fail over to another Node.
  • In SNAT IP failover, existing connections will be broken, as we do not replicate connection state.

Active/standby Nodes
Controller selects a pair of active/standby Nodes for each SNAT IP. Active/standby Nodes use a distributed protocol to decide the active Node and even replicate connection state.
One possible solution in Linux is to leverage conntrackd and keepalived.

If we can assume all SNAT IPs can be reachable from every Node, the implementation could be simpler, as the source Nodes need not to know the SNAT IP is available on which Node, but just tunnels/routes the packets to the SNAT IP. If this is not the case (for example, SNAT IPs are in a separate network from the Node network, and are assigned to extra NICs of a specific set of Nodes act as egress Nodes), we need some way to notify all Nodes about the current active egress Node for a SNAT IP, either through Controller or K8s API which then again require Controller or K8s API must be available in the failover case, another distributed protocol.

Data path design

The source Node forwards (through tunnel or routing in noEncap or hybrid mode) the egress packets to the egress Node, and egress Node SNAT the packets with the assigned SNAT IP.

@jianjuns jianjuns added the proposal A concrete proposal for adding a feature label Apr 29, 2020
@rangar
Copy link

rangar commented May 20, 2020

Have you considered multiple interfaces/networks Nodes can be connected to ? Will I be able to pick which interface/network I can SNAT from ?

@jianjuns
Copy link
Contributor Author

Have you considered multiple interfaces/networks Nodes can be connected to ? Will I be able to pick which interface/network I can SNAT from ?

Yes, mentioned multiple interfaces in the "NodePool CRD" section. But as other ideas described in the proposal, I have no detailed design yet, and we might need to look into the details.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2020
@tnqn tnqn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2021
@tnqn
Copy link
Member

tnqn commented Jan 27, 2021

@jianjuns I'm working on this and have some questions want to discuss with you:

  1. About the multiple interfaces/networks a Node can connect to, is it a valid use case to consider? I thought the egress policy applies to to all external traffic so it should always be the interface that has the default route to be the egress interface? If there are multiple networks, then the policy should have something like DestinationCIDR? We will have to configure routes if they don't match user configured policy (destinationCIDR => interface / egress IP)
  2. Since user can use PodSelector and NamespaceSelector to select Pods, a Pod may be selected by multiple policies and it seems difficult to prevent it as pod could be created after policy. What do we do for this case? Is randomly SNATing to one of them acceptable as this shouldn't be a valid use case? Another way is to make the egresspolicy 1:1 map to namespace, but I guess you wouldn't sacrifice the flexibility for the unusual case.
  3. Is NodePool CRD needed? Do you think we could just use Node labelSelector to make the configuration easier? Typically user can label certain Nodes as egress Nodes and select them via NodeSelector in policy.

@jianjuns
Copy link
Contributor Author

  1. I am thinking about the case you have a subset of Nodes for egress, which have extra NICs on a different physical subnet. If we allocate SNAT IPs to Nodes, then seems we might need to configure routes too, but for the 1st version if we just assume IPs are configured manually by users, then we can assume routes are correctly configured too.
  2. This is a valid point. Unless we introduce priority, seems we can only randomly select an IP. Another choice is to fall back to Service and Namespace annotation, but then as you said we lose the flexibility.
    There is an upstream proposal which also proposes label selector: sig-network: Add egress-source-ip-support KEP kubernetes/enhancements#1105
  3. You mean to select Nodes in the SNAT policy CRD? Basically I am trying to separate IP management from SNAT, so SNAT policies can be independent of IP management. And probably we should consider making IP management part generic, so it can be shared by other features like L4 LB (assuming we might implement LB type Services too).

@tnqn
Copy link
Member

tnqn commented Jan 28, 2021

  1. I mean the Nodes that can access external should have a default route on one of its NICs, right? So we don't need to care about how many NICs they have and could always assign the SNAT IP to the NIC with default route? If this is a reasonable assumption, we could configure the IP automatically instead of asking user to do it.
  2. Then maybe let's first user labelSelector and assume policies overlapping is not normal.
  3. Yes, I mean how user select the Nodes for a policy. I see you proposed NodePool CRD and wondered if we can just use labelSelector. For example, user could label certain Nodes with "egress: true", then configure the egress policy's nodeSelector: egress=true. I feel it might be easier to use and implement.

@jianjuns
Copy link
Contributor Author

  1. Right, for now we might assume there is a default route. Later we can consider using different SNAT IP for different destination, then not necessarily a default route. But I would do IP configuration later still (in my mind we need to support auto IP->Node assignment together).
  2. Ok.
  3. In your proposal, user needs to duplicate the label selector for every SNAT policy? And how we define that for Service when we support L4 LB? I would either have a separate (group) CRD to select Node, or select Nodes in IPPool CRD. But we can decide that when we support auto IP assignment?

@jianjuns
Copy link
Contributor Author

In the first version, we require users to manually configure SNAT IPs on the Nodes. In a SNATPolicy, a particular SNAT IP can be specified for the selected Pods, and antrea-controller will publish the SNATPolicy to the Nodes on which the selected Pods run.
On the Node, antrea-agent will realize the SNATPolicy with OVS flows and iptables rules. If the SNAT IP is not present on the local Node, the packets to be SNAT'd will be tunneled to the SNAT Node using the SNAT IP to be the tunnel destination IP. On the SNAT Node, the tunnel destination IP will be directly used as the SNAT IP.
On the SNAT Node, an iptables rule will be added to perform the SNAT with the specified SNAT IP, but which SNAT IP to use for a given packet is controlled by the OVS flows. The OVS flows will mark a packet that needs to be SNAT'd with a SNAT IP with the corresponding integer ID, and the corresponding iptables SNAT rule matches the packet MARK.

The OVS flow changes include:

table 31
// SNAT flows for Windows
- priority=210 ip,-new+trk,snatCTMARK,from_uplink macRewriteMark,goto:40 (SNAT return traffic)
+ priority=210 ip,-new+trk,snatCTMARK,from_uplink,nw_dst=localSubnet macRewriteMark,goto:40 (SNAT return traffic - remote packets will be handled by L3Fwd flows, so no need to set the macRewrite MAC)

table 70
// Reuse these Windows SNAT flows to skip packets need not SNAT
+priority=200 ip,from_local,nw_dst=localSubnet goto:80
+priority=200 ip,from_local,nw_dst=nodeIP goto:80
+priority=200 ip,from_local,nw_dst=gatewayCTMark goto:80

// Send packets for external network to the SNAT table
+priority=190 ip,from_local goto:71
+priority=190 ip,macRewriteMark mod_dl_dst:gw0_mac,goto:71 (traffic tunneled from remote Nodes)

+table 71 (snatTable. ttlDecTable is moved to table 72)
// Windows flows: load SNAT IP to a register (probably share the endpointIPReg and endpointIPv6XXReg)
priority=200 ip,+new+trk,in_port=local_pods snatRequiredMark(snat_ip),goto:80 (SNAT for local Pods, matching in_ports)
priority=200 ip,+new+trk,tun_dst=snat_ip snatRequiredMark(tun_dst),goto:80 (SNAT for remote Pods, matching tun_dst)
priority=190 ip,+new+trk snatRequiredMark(node_ip),goto:80 (default SNAT IP)

// Linux: mark the packet with an integer ID allocated for each SNAT IP
priority=200 ip,+new+trk,in_port=local_pods mark(snat_id),goto:80 (SNAT for local Pods)
priority=200 ip,+new+trk,tun_dst=snat_ip mark(snat_id),goto:80 (SNAT for remote Pods)

// common: tunnel packets need to SNAT on a remote Node with the SNAT IP to be the outer destination
priority=200 ip,in_port=local_pods mod_dl_src:gw0_mac,mod_dl_dst:vMAC,snat_ip->NXM_NX_TUN_IPV4_DST,goto:72
priority=0 goto_table:80

+table 72 (ttlDecTable)

table 105
// Windows: perform SNAT with the SNAT IP saved in the register
+priority=200 ip,+new+trk,snatRequiredMark ct(commit,table=110,zone=65520,nat(src=snat_ip),snatCTMark)

iptables rules:
iptables -t nat -A POSTROUTING -m mark --mark snat_id -j SNAT --to-source snat_ip

@jianjuns
Copy link
Contributor Author

jianjuns commented Feb 19, 2021

@tnqn : let me know what you think ^^
I tested the flows and iptables rules already.

@tnqn
Copy link
Member

tnqn commented Feb 23, 2021

@jianjuns the proposal LGTM, one question about configuring SNAT IP and publishing SNATPolicy:
It looks like in the first version the SNATPolicy doesn't have any Node information. How does a Node know itself is a SNAT node and which are SNAT IPs if SNATPolicy is only pushed to Nodes that run the selected Pods? Or if it's still required to specify a Node in SNATPolicy or there is a NodePool CRD, is it really needed to ask user configure the SNAT IP manually?

@jianjuns
Copy link
Contributor Author

@tnqn : I think each Node can discover all local IPs, and based on that decide whether or not to perform SNAT locally or tunnel to the SNAT IP.

Do you have extra thoughts on the 1st version scope, like should we do IPv4 only or not?

@tnqn
Copy link
Member

tnqn commented Feb 24, 2021

I assume agent will need to treat all local IPs as potential SNAT IPs and configure openflows, allocate mark IDs, and configure iptables rules for them. Could it lead to many unnecessary configurations? For example, when kube-proxy ipvs mode is used, all service IPs will be configured on a network interface, there might be other cases that we are not aware of in production?

@jianjuns
Copy link
Contributor Author

Do we have some way to filter IPs and assume we have a reasonable set of IPs to care about?

Another way is to watch all SNAT policies and know the IPs can be used.

@tnqn
Copy link
Member

tnqn commented Feb 24, 2021

I think filtering IPs approach might be not clean and become complex to adapt all scenarios. Using SNATPolicy as source of truth sounds good to me.

Is this the first version scope in your mind:

  1. User needs to configure SNAT IPs manually and reasonablely (no duplicate, no missing)
  2. User configures a SNATPolicy with PodSelector and SNAT IP
  3. No failover if the Node holds the SNAT IP crashes
  4. Encap mode only
  5. dual-stack?

@jianjuns
Copy link
Contributor Author

Yes, that is what I am thinking. Do you think it can save some work if we start from IPv4 and Linux (not much to support Windows too)?

@tnqn
Copy link
Member

tnqn commented Feb 24, 2021

I feel the main work to support IPv6 is more about testing as the design doesn't sound address family specific, while I'm not sure the extra work to support windows given that it doesn't use iptables to do SNAT. I would lean to support dual-stack and only for Linux in the first version.

@jianjuns
Copy link
Contributor Author

Windows is even easier? As we do SNAT with OVS only. But the same as IPv6, there will be work for testing. If our target is 0.14, I think we can just do IPv4 on Linux.

@tnqn
Copy link
Member

tnqn commented Feb 24, 2021

Sure, IPv4 on Linux sounds good to me.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2021
@antoninbas
Copy link
Contributor

I am going to close this. Other more specialized issues can be created to address gaps in the implementation: Windows, noEncap, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. proposal A concrete proposal for adding a feature
Projects
None yet
Development

No branches or pull requests

4 participants