Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIS scan 1.4 does not work on a multi node cluster #27652

Closed
sowmyav27 opened this issue Jun 19, 2020 · 9 comments
Closed

CIS scan 1.4 does not work on a multi node cluster #27652

sowmyav27 opened this issue Jun 19, 2020 · 9 comments
Assignees
Labels
kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/blocker
Milestone

Comments

@sowmyav27
Copy link
Contributor

sowmyav27 commented Jun 19, 2020

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

  • Deploy a cluster - 1 etcd, 1 control and 2 worker nodes
  • k8s can be - 1.18.4/1.17.7/1.16.11
  • When the cluster is up and Active, run CIS 1.4 Permissive scan on the cluster
  • The scan report is seen stuck in "Running" state.

Expected Result:
The scan should finish and the report should be generated successfully.

Other details that may be helpful:
Note:

  • on 1.15.12-rancher2-3 - 4 node - 1 etcd,1 control and 2 workers - Scan runs fine.
  • On a 1 node - all roles, 1.18.4 cluster, the scan runs fine

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.4.5-rc8
  • Installation option (single install/HA): single

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): custom
  • Kubernetes version (use kubectl version):
1.18.4/1.17.7/1.16.11

gz#13130
gz#13356

@sowmyav27 sowmyav27 added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/blocker labels Jun 19, 2020
@sowmyav27 sowmyav27 added this to the v2.4.5 milestone Jun 19, 2020
@maggieliu maggieliu modified the milestones: v2.4.5, v2.4.6 Jun 19, 2020
@prachidamle
Copy link
Member

Analysis of the problem so far with @leodotcloud :

  • CIS scans are failing because DNS name resolution is not working as the sonobuoy containers run with net:host and dnsPolicy: ClusterFirstWithHostNet
  • So, it's not a problem with just CIS ... it's a problem with a basic k8s scenario for the new versions 1.18.4/1.17.7/1.16.11
  • This problem can be recreated by launching any deployment with net:host and dnsPolicy: ClusterFirstWithHostNet on a cluster with these k8s versions
  • CIS scans run fine for a cluster with 1.18.3/1.17.6/1.16.10 for a similar node/role setup.

@maggieliu
Copy link

maggieliu commented Jun 25, 2020

This should be resolved by PR here: kubernetes/kubernetes#92354. Waiting for the next k8s patch release.

@sowmyav27
Copy link
Contributor Author

sowmyav27 commented Jul 10, 2020

This issue is still reproducible with k8s - 1.18.5, 1.17.8, 1.16.12.

Note:
Canal and flannel -->CIS scan Does NOT work. Weave and Calico --> CIS scan works.

@Oats87
Copy link
Contributor

Oats87 commented Jul 10, 2020

This issue is not being caused by kubernetes/kubernetes#92354

UDP service resolution from the host network to a non-local node (in the case where kube-dns is run on 2/3 nodes and you try to resolve using the service portal from the third) does not work:

root@ip-172-31-13-169:~# dig rancher.com @10.43.0.10
; <<>> DiG 9.11.3-1ubuntu1.9-Ubuntu <<>> rancher.com @10.43.0.10
;; global options: +cmd
;; connection timed out; no servers could be reached

What does work is TCP resolution, i.e.

root@ip-172-31-13-169:~# dig +tcp rancher.com @10.43.0.10

; <<>> DiG 9.11.3-1ubuntu1.9-Ubuntu <<>> +tcp rancher.com @10.43.0.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59809
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 46f7bf0b2f6623f9 (echoed)
;; QUESTION SECTION:
;rancher.com.			IN	A

;; ANSWER SECTION:
rancher.com.		30	IN	A	104.26.4.146
rancher.com.		30	IN	A	172.67.71.14
rancher.com.		30	IN	A	104.26.5.146

;; Query time: 37 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Fri Jul 10 04:51:09 UTC 2020
;; MSG SIZE  rcvd: 133

@prachidamle
Copy link
Member

This upstream issues seems very relevant: kubernetes/kubernetes#87852

@Oats87
Copy link
Contributor

Oats87 commented Jul 10, 2020

Reverting the same cluster to v1.18.3-rancher2-2 made remote UDP 53 DNS resolution work.

@Oats87
Copy link
Contributor

Oats87 commented Jul 10, 2020

Running ethtool --offload flannel.1 rx off tx off made it start working

root@ip-172-31-13-169:~# dig @10.43.0.10 google.com
^Croot@ip-172-31-13-169:~# ethtool --offload flannel.1 rx off tx off
Actual changes:
rx-checksumming: off
tx-checksumming: off
	tx-checksum-ip-generic: off
tcp-segmentation-offload: off
	tx-tcp-segmentation: off [requested on]
	tx-tcp-ecn-segmentation: off [requested on]
	tx-tcp-mangleid-segmentation: off [requested on]
	tx-tcp6-segmentation: off [requested on]
root@ip-172-31-13-169:~# dig @10.43.0.10 google.com

; <<>> DiG 9.11.3-1ubuntu1.9-Ubuntu <<>> @10.43.0.10 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17227
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 3d0e2dacabf1e076 (echoed)
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		30	IN	A	172.217.5.14

;; Query time: 2 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Fri Jul 10 06:26:10 UTC 2020
;; MSG SIZE  rcvd: 77

root@ip-172-31-13-169:~#

@prachidamle
Copy link
Member

This is the exact workaround mentioned here kubernetes/kubernetes#87852 (comment)

@sowmyav27
Copy link
Contributor Author

Verified with 2.4.5 and KDM pointing to dev-v2.4

  • Deployed clusters using k8s 1.18.6-rancher1-1, 1.17.9-rancher1-1 and 1.16.13-rancher1-1 - all network providers - 4 node clusters - 1 etcd, 1 control plane and 2 worker nodes and 2 nodes - 1 etcd/control/worker and 1 worker node
  • CIS scan worked fine on all the clusters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/blocker
Projects
None yet
Development

No branches or pull requests

4 participants