Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network Policy Rule Evaluation Blocks Traffic to DNS Server #146

Open
junzebao opened this issue Oct 31, 2024 · 7 comments
Open

Network Policy Rule Evaluation Blocks Traffic to DNS Server #146

junzebao opened this issue Oct 31, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@junzebao
Copy link

What happened:
I created a new EKS clusters and enabled network policy in the VPC-CNI, but I realized the following NetworkPolicy blocks the dns resolution requests. It was working when I used Calico as the network plugin.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: curl
  namespace: default
spec:
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
          - 10.0.0.0/8
          - 172.16.0.0/12
          - 192.168.0.0/16
  - ports:
      - protocol: UDP
        port: 53
      - port: 53
        protocol: TCP
  policyTypes:
    - Egress
  podSelector:
    matchLabels:
      app: curl

Attach logs

What you expected to happen:
The first rule would block DNS requests to kube-dns (172.20.0.10 in our case), but the second rule should allow the request.

How to reproduce it (as minimally and precisely as possible):
Create a pod with label app: curl in the default namespace as the NetworkPolicy on an EKS cluster with network policy enabled in VPC-CNI. I attached this config to the EKS addon { "enableNetworkPolicy": "true" }.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): Client Version: v1.31.1, Server Version: v1.30.5-eks-ce1d5eb
  • CNI Version: v1.18.1-eksbuild.3
  • OS (e.g: cat /etc/os-release): Linux Bottlerocket OS 1.22.0 (aws-k8s-1.30)
  • Kernel (e.g. uname -a):
@junzebao junzebao added the bug Something isn't working label Oct 31, 2024
@orsenthil orsenthil transferred this issue from aws/amazon-vpc-cni-k8s Oct 31, 2024
@orsenthil
Copy link
Member

The first rule would block DNS requests to kube-dns (172.20.0.10 in our case), but the second rule should allow the request.

If you enable network policy event logs, do you see the toggle from accept to block? Looking at the event logs can shed more details on what is happening with these evaluation rules.

@Pavani-Panakanti
Copy link

Network policy controller is creating policy endpoint rules in a way that when we have conflicting rules, precedence is to DENY over ALLOW. Based on upstream recommendation we should do a ALLOW in such cases. This needs a change in Network Policy Controller. We are looking into it and will update on the fix here.

Policy endpoint generated for the above network policy

dev-dsk-pavanipt-2a-0981017d % kubectl describe policyendpoint curl-4mhvr
Name:         curl-4mhvr
Namespace:    test1
Labels:       <none>
Annotations:  <none>
API Version:  networking.k8s.aws/v1alpha1
Kind:         PolicyEndpoint
Metadata:
  Creation Timestamp:  2024-11-12T23:06:46Z
  Generate Name:       curl-
  Generation:          1
  Owner References:
    API Version:           networking.k8s.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  NetworkPolicy
    Name:                  curl
    UID:                   acc0d91d-e6ab-46fe-a8bb-aad64e6eaaec
  Resource Version:        1073029
  UID:                     7415e7fe-6711-4df3-8664-ba5ae73d71a1
Spec:
  Egress:
    Cidr:  ::/0
    Ports:
      Port:      53
      Protocol:  UDP
      Port:      53
      Protocol:  TCP
    Cidr:        0.0.0.0/0
    Except:
      10.0.0.0/8
      172.16.0.0/12
      192.168.0.0/16
    Ports:
      Port:      53
      Protocol:  UDP
      Port:      53
      Protocol:  TCP
  Pod Isolation:
    Egress
  Pod Selector:
    Match Labels:
      App:  tester2
  Pod Selector Endpoints:
    Host IP:    192.168.21.1
    Name:       tester2-55c7c875f-xx96w
    Namespace:  test1
    Pod IP:     192.168.15.193
  Policy Ref:
    Name:       curl
    Namespace:  test1
Events:         <none>

For your specific case, providing explicit CIDR "10.0.0.0/8" in the second rule should fix the issue. Let us know if this works for you as a workaround

@junzebao
Copy link
Author

junzebao commented Nov 29, 2024

thanks for the clarification @Pavani-Panakanti. PolicyEndpoint does make it clearer to help me troubleshoot. It really confused me that our network policies behave differently after switching from calico to AWS-VPC-CNI.

I've got another issue that some cronjob that tries to push metrics to prometheus pushgateway within the same cluster fails sporadically due to network policies, but they sometimes succeed on retry. After removing the network policy, the job always succeeded. Do you have any hints why?

---- Update below with my findings ----

I have a cronjob CJ1 that needs to access a service SVC1, and I have a netpol that uses podSelector to target pods behind SVC1 and an ingress rule that allows traffic from CJ1 pods. However, I noticed when CJ1 pods are up, the PolicyEndpoint doesn't get updated to have the new IP of CJ1 pod until 8s in. The CJ1 pod would directly query the SVC1 after startup, so it's always failing. It succeeds when it gets allocated the same IP of a previously failing pod, because the rule is already present. (The pod is in error state, not deleted from the cluster yet. Although I thought the IP occupied by an error pod shouldn't be allocated to other pods?)

I tried to increase the CPU requests for the nodeAgent in aws-node, but it doesn't help. Is there a performance issue here?

@Pavani-Panakanti
Copy link

Pavani-Panakanti commented Dec 3, 2024

@junzebao It is expected to take 1-2 secs for the new pod to be reconciled and updated on the SVC1 side. 8 secs delay should not be happening. Can you send the logs of both the nodes egress and ingress where you saw the 8 secs delay for the reconciliation of new pod IP to k8s-awscni-triage@amazon.com ? This would be node of cronJob CJ1 for egress and node of SVC1 for ingress where traffic was denied for 8 secs

@junzebao
Copy link
Author

junzebao commented Dec 4, 2024

After I enabled ANNOTATE_POD_IP, the delay reduced to 1-2 seconds but that still doesn't solve my problem. I have a cronjob that reaches out to a k8s service immediately after it starts. Right now I can only use initContainers to check if the connection is open. It only requires 2s.

Is there any way to solve this? or is this the expected behavior and we have to work around it?

@Pavani-Panakanti
Copy link

@junzebao This is expected behavior. Once the new pod is created, it takes some short time for its IP to be published in all network policies to reach an eventual consistent state and from here it should work without any issues. Services are expected to handle these by having retries.
Few other suggestions to handle this -

  • Instead of using pod selector for ingress in your cron job, you can use the IP CIDR of pods in SVC1 if thats possible in your case. This way when new pod comes up in SVC1, cron job NP need not be updated every time to include the new Pod IP.
  • You can also have an init sleep of 2 secs in your service SVC1 pods. By the time your SVC pod starts to send the traffic the pod IP will be updated for ingress in your cron job.

@Pavani-Panakanti Pavani-Panakanti transferred this issue from aws/aws-network-policy-agent Dec 5, 2024
@m00lecule
Copy link

@Pavani-Panakanti is this issue related to aws/aws-network-policy-agent#345?

This issue is being transferred. Timeline may not be complete until it finishes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants