[BUG] When running on cloud providers ingress and service-ca will not stay online #270

cooktheryan · 2021-09-14T21:12:01Z

What happened:

Repeated restarts of openshift dns, ingress, and service-ca pods.

NOTE: Observed before SELinux modifcations merged

What you expected to happen:

Pods do not restart

How to reproduce it (as minimally and precisely as possible):

Make microshift
Start microshift Service
kubectl get po -wA

Environment:

Microshift version (use microshift version):

[root@microshift ~]# microshift version
Microshift Version: 4.7.0-0.microshift-2021-08-31-224727-17-gb1fb489
Base OKD Version: 4.7.0-0.okd-2021-06-13-090745

Hardware configuration:
RAM
16GB
VCPUs
32 VCPU
OS (e.g: cat /etc/os-release): Red Hat Enterprise Linux release 8.4 (Ootpa)
Kernel (e.g. uname -a): 4.18.0-305.el8.x86_64
Others:

Relevant Logs

Router

I0914 21:06:11.814726       1 template.go:433] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 9cc0c8fc\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2021-06-11T16:32:09Z\n"
I0914 21:06:11.817520       1 metrics.go:154] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"="0.0.0.0:1936"
I0914 21:06:11.822374       1 router.go:191] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I0914 21:06:11.822468       1 router.go:270] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I0914 21:06:11.823069       1 router.go:332] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I0914 21:06:11.823134       1 router.go:262] router "msg"="router is including routes in all namespaces"  
E0914 21:06:11.929475       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0914 21:06:11.965842       1 router.go:579] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0914 21

service ca

I0914 21:10:48.891644       1 reflector.go:530] k8s.io/kube-aggregator/pkg/client/informers/externalversions/factory.go:117: Watch close - *v1.APIService total 0 items received
I0914 21:10:48.891907       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.MutatingWebhookConfiguration total 0 items received
I0914 21:10:48.891938       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
E0914 21:10:48.891966       1 leaderelection.go:325] error retrieving resource lock openshift-service-ca/service-ca-controller-lock: Get "https://10.43.0.1:443/api/v1/namespaces/openshift-service-ca/configmaps/service-ca-controller-lock?timeout=35s": read tcp 10.42.0.3:34648->10.43.0.1:443: read: connection timed out
I0914 21:10:48.891462       1 reflector.go:530] k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go:117: Watch close - *v1.CustomResourceDefinition total 0 items received
I0914 21:10:58.859985       1 leaderelection.go:278] failed to renew lease openshift-service-ca/service-ca-controller-lock: timed out waiting for the condition
I0914 21:10:58.860110       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' service-ca-64547678c6-h8smv_c0bf546f-4fa0-4449-b99c-9e9dd548d1fe stopped leading
I0914 21:10:58.861208       1 request.go:844] Error in request: resource name may not be empty
E0914 21:10:58.861261       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
W0914 21:10:58.861309       1 leaderelection.go:75] leader election lost

The text was updated successfully, but these errors were encountered:

rootfs · 2021-09-14T22:02:12Z

10.42.0.3:34648->10.43.0.1:443: read: connection timed out

looks a security group config issue? maybe open up port 443 and try again? also maybe the CIDR 10.43.0.0/16 may collide with openstack vpc

cooktheryan · 2021-09-14T23:28:06Z

For my issue there is a service that needs restarted but I haven't been able to identify which service yet. It seems like once the pods are created and I reboot then everything operates as expected after the reboot so it has to be some service

ianzhang366 · 2021-09-15T21:21:17Z

Hi @rootfs @cooktheryan

I tried the following settings:
0. enabled 443 on EC2 instance

disable firewall
start microshift with the following:

cluster:
  url: https://127.0.0.1:6443
  clusterCIDR: 172.17.1.0/24
  serviceCIDR: 172.17.2.0/24
  dns: 172.17.2.10

However, I'm still getting the same error:

$ kubectl logs -n kubevirt-hostpath-provisioner   kubevirt-hostpath-provisioner-4lhx8
F0915 21:13:59.846380       1 hostpath-provisioner.go:270] Error getting server version: Get https://172.17.2.1:443/version?timeout=32s: dial tcp 172.17.2.1:443: i/o timeout

Also, I tried to change the hostname on my EC2 instance but got the same error.

Any idea how to progress this one?

rootfs · 2021-09-15T21:36:33Z

@cooktheryan @ianzhang366 i'll spin an EC2 instance to investigate. In the meantime, can you check if all-in-one microshift container work for you on EC2 and openstack?

sudo setsebool -P container_manage_cgroup true
sudo podman volume create ushift-vol
sudo podman run -d --rm --name ushift --privileged -v /lib/modules:/lib/modules -v ushift-vol:/var/lib --hostname ushift -p 6443:6443 quay.io/microshift/microshift:4.7.0-0.microshift-2021-08-31-224727-aio-linux-amd64

ianzhang366 · 2021-09-16T01:36:24Z

Tried on a EC2

[ec2-user@ip-172-31-33-139 ~]$ sudo setsebool -P container_manage_cgroup true
[ec2-user@ip-172-31-33-139 ~]$ sudo podman volume create ushift-vol
ushift-vol
[ec2-user@ip-172-31-33-139 ~]$ sudo podman run -d --rm --name ushift --privileged -v /lib/modules:/lib/modules -v ushift-vol:/var/lib --hostname ushift -p 6443:6443 quay.io/microshift/microshift:4.7.0-0.microshift-2021-08-31-224727-aio-linux-amd64
Trying to pull quay.io/microshift/microshift:4.7.0-0.microshift-2021-08-31-224727-aio-linux-amd64...
Getting image source signatures
Copying blob dfd8c625d022 done
Copying blob d90f0305a1ba done
Copying blob 6990ea5645c7 done
Copying blob 893dc2344ef6 done
Copying blob 624513fea370 done
Copying blob b63aa9a6b5a1 done
Copying blob c11223d418c7 done
Copying blob 441ee9c555e7 done
Copying blob 00ad89df3ad9 done
Copying config b9465f0252 done
Writing manifest to image destination
Storing signatures
703032913c86ef9ce78c5976218706d8e14c522bfa9ee252544655cf6af207cf
[ec2-user@ip-172-31-33-139 ~]$ podman ps
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES
[ec2-user@ip-172-31-33-139 ~]$ date
Thu Sep 16 01:29:00 UTC 2021
[ec2-user@ip-172-31-33-139 ~]$ podman ps -a
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES
[ec2-user@ip-172-31-33-139 ~]$ date
Thu Sep 16 01:33:44 UTC 2021
[ec2-user@ip-172-31-33-139 ~]$ podman ps -a
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES
[ec2-user@ip-172-31-33-139 ~]$ podman exec -it ushift bash
Error: no container with name or ID "ushift" found: no such container

it seems the RHEL can't bring up the container...

rootfs · 2021-09-16T14:23:17Z

This looks a known issue on RHEL 8.4 but gets resolved in 8.5.

A temp workaround is to disable nm-cloud-setup service:

systemctl disable nm-cloud-setup.service nm-cloud-setup.timer
reboot

cooktheryan added the kind/bug Categorizes issue or PR as related to a bug. label Sep 14, 2021

cooktheryan closed this as completed Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] When running on cloud providers ingress and service-ca will not stay online #270

[BUG] When running on cloud providers ingress and service-ca will not stay online #270

cooktheryan commented Sep 14, 2021 •

edited

Loading

rootfs commented Sep 14, 2021

cooktheryan commented Sep 14, 2021

ianzhang366 commented Sep 15, 2021

rootfs commented Sep 15, 2021

ianzhang366 commented Sep 16, 2021

rootfs commented Sep 16, 2021 •

edited

Loading

[BUG] When running on cloud providers ingress and service-ca will not stay online #270

[BUG] When running on cloud providers ingress and service-ca will not stay online #270

Comments

cooktheryan commented Sep 14, 2021 • edited Loading

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Environment:

Relevant Logs

rootfs commented Sep 14, 2021

cooktheryan commented Sep 14, 2021

ianzhang366 commented Sep 15, 2021

rootfs commented Sep 15, 2021

ianzhang366 commented Sep 16, 2021

rootfs commented Sep 16, 2021 • edited Loading

cooktheryan commented Sep 14, 2021 •

edited

Loading

rootfs commented Sep 16, 2021 •

edited

Loading