Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller fails to pick up ImageCache resource on fresh install if webhook server is down #70

Closed
Chili-Man opened this issue Apr 14, 2021 · 2 comments
Assignees

Comments

@Chili-Man
Copy link
Contributor

When installing kube-fledged directly from the helm chart, both the imagecache resource, webhook and the controller servers are deployed at same time and if the controller comes up before the webhook server, then the controller runs into the following error:

I0414 00:43:56.288736       1 controller.go:122] Setting up event handlers
I0414 00:43:56.289021       1 main.go:75] Starting pre-flight checks
I0414 00:43:56.487536       1 controller.go:158] No dangling or stuck jobs found...
I0414 00:43:56.494524       1 controller.go:208] No dangling or stuck imagecaches found...
I0414 00:43:56.494553       1 main.go:79] Pre-flight checks completed
I0414 00:43:56.494582       1 controller.go:223] Starting fledged controller
I0414 00:43:56.494588       1 controller.go:226] Waiting for informer caches to sync
I0414 00:43:56.681989       1 controller.go:231] Starting image cache worker
I0414 00:43:56.682138       1 controller.go:238] Starting cache refresh worker
I0414 00:43:56.682155       1 controller.go:242] Started workers
I0414 00:43:56.682162       1 image_manager.go:340] Starting image manager
I0414 00:43:56.682170       1 image_manager.go:343] Waiting for informer caches to sync
I0414 00:43:56.682179       1 controller.go:429] Starting to sync image cache kernels-default(create)
E0414 00:43:56.700798       1 controller.go:491] Error updating imagecache status to Processing: Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.k8s.io": Post "https://kubefledged-webhook-server.management.svc:3443/validate-image-cache?timeout=1s": no endpoints available for service "kubefledged-webhook-server"
E0414 00:43:56.701170       1 controller.go:366] error syncing imagecache: Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.k8s.io": Post "https://kubefledged-webhook-server.management.svc:3443/validate-image-cache?timeout=1s": no endpoints available for service "kubefledged-webhook-server"
E0414 00:43:56.701206       1 controller.go:377] error syncing imagecache: Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.k8s.io": Post "https://kubefledged-webhook-server.management.svc:3443/validate-image-cache?timeout=1s": no endpoints available for service "kubefledged-webhook-server"
I0414 00:43:57.082511       1 image_manager.go:348] Started image manager

Then the controller does not pick up the imagecache resource that was deployed. The current workaround is to manually delete the controller pod and then it's able to pick up the imagecache resource.

@senthilrch
Copy link
Owner

@Chili-Man

Thanks for posting this issue. Will look into this for v0.8.0

@senthilrch
Copy link
Owner

@Chili-Man

For deploy-using-yaml, I have fixed this by introducing "kubectl rollout status deployment kubefledged-webhook-server --watch" before applying manifest of controller.

For helm-chart, I'll look out for different solution (init container etc.)

AnchorArray added a commit to noteable-io/kube-fledged that referenced this issue Sep 16, 2021
* Move custom resource definitions to crds directory

* modified travis build conditions

* Initial commit for v0.8.0

* Bumped up versions of go, alpine, operator sdk, docker, cri-tool

* upgrade go in travis ci

* update dependencies

* code changes for upgraded dependencies

* fix unit tests

* make release-amd64 to build all 4 amd64 images

* v1alpha1 -> v1alpha2, kubefledged.k8s.io -> kubefledged.io

* ignore build binaries

* add labels to manifests

* updates to imagecache manifest

* update apigroup apiversion in validatingwebhook

* issue senthilrch#70 deploy-using-yaml: wait for webhook-server running

* issue senthilrch#66: upgrade crd api version to v1

* updates to helm chart & operator

* update clusterrole to list and watch

* update signer name in csr

* remove v1alpha1 cr

* issue senthilrch#70 add init container to wait for webhook-server

* helm chart fix webhook service name

* helm chart update

* fix chart apiVersion

* updated readme for helm chart installation

* updated name of helm operator cr

* fix issue senthilrch#75: workaround

* Ensure validating webhook configuration client config service name for the webhook server mataches the correct webhook service name

* Ensure validating webhook configuration client config service name for the webhook server mataches the correct webhook service name

* add init option to webhook server

* update manifests

* updated helm chart

* updated makefile and manifests

* updated helm operator

* makefile wait for operator ready

* get imagecache before updating status

* pre-install hook for validatingwebhookconfiguration

* fix golint errors

* modify refresh/purge annotation key to kubefledged.io/xxx

* cri-client-image name as env instead of cmd flag

* read busybox image from env

* use busybox image from gcr.io to overcome dockerhub ratelimiting

* fix issue senthilrch#89 change hostpath filetype to socket

* delete pre-install hook in "make remove-operator-and-kubefledged"

* add annotations to validatingwebhookconfiguration

* add "helm repo update" to readme

* continue processing job deletion when not found

* check if refresh-cache annotation exists

* add "helm repo update" to readme

* update helm chart to use release namespace

* update design proposal document

* expose helm parameters in operator CR

* document helm parameters

* update release version to v0.8.2

* set status to known when unable to fetch pod

* add check for image pull/delete status unknown

* update log messages

* deploy controller and operator to same namespace

* fix unit test errors

* restore Kubefledged CR during "remove-operator-and-kubefledged"

* Update README.md

* Update README.md

* Update design-proposal.md

* Update README.md

Co-authored-by: Diego Rodriguez <diego@noteable.io>
Co-authored-by: Senthil Raja Chermapandian <senthilrch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants