Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit K8S events to the postgresql CR as feedback to the requestor / user #896

Merged
merged 6 commits into from Apr 27, 2020

Conversation

frittentheke
Copy link
Contributor

Currently the only "feedback" the user or creator of a postgresql custom resource receives is the status. While the operator does produce quite a bit of log, this is not suitable to point users to in order for them to find issues with their particular cluster.

This PR shall enable to operator to (also) send events which are promoted by the Kubernetes documentation (https://kubernetes.io/docs/tasks/debug-application-cluster/#example-debugging-pending-pods) to help debugging issues with a resource and to search for messages from controllers handling this resource.

As this shall be a first cut, I might not have added enough events or even have added events that are too noisy. My intention was to report back certain aspects of the life-cycle:

  • Creation with the individual steps / resources being created
  • (Warnings in case resource limits are being adjusted)
  • Sync in case anything goes wrong
  • Deletion (might take a while or even run into problems - there seems to be not a lot of error handling / reporting there yet)
  • Switchover or Rolling-Restarts initiated by the Operator

@frittentheke frittentheke changed the title Start sending K8S events as feedback to the postgresql CR Emit K8S events to the postgresql CR as feedback to the requestor / user Apr 2, 2020
@FxKu
Copy link
Member

FxKu commented Apr 15, 2020

Thanks for this interesting PR. There was already an internal request to extend the status, but events would also improve the user experience, I guess.

So far it doesn't work properly for me. Getting this error:
Could not construct reference to: '&v1.Postgresql{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""},

Could you also update the clusterrole here.

@frittentheke
Copy link
Contributor Author

Thanks for this interesting PR. There was already an internal request to extend the status, but events would also improve the user experience, I guess.

So far it doesn't work properly for me. Getting this error:
Could not construct reference to: '&v1.Postgresql{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""},

@FxKu In general this works for me, but with the Postgresql reference passed along the various functions quite a lot there might be a call to the event function that does not get the input it expects...

Since events are logged (by the operator) along with the other logs, can you narrow this down to a certain action ?

Could you also update the clusterrole here.

Done. I melted this into last commit.

@FxKu
Copy link
Member

FxKu commented Apr 17, 2020

it's for all the events 😃

at the end of each error output I see also:
no kind is registered for the type v1.Postgresql in scheme "k8s.io/client-go/kubernetes/scheme/register.go:69"'

Btw: Unit tests in pkg/cluster/cluster_test.go and pkg/cluster/k8sres_test.go are failinng because the event recorder is missing in the constructor

@frittentheke
Copy link
Contributor Author

it's for all the events smiley

at the end of each error output I see also:
no kind is registered for the type v1.Postgresql in scheme "k8s.io/client-go/kubernetes/scheme/register.go:69"'

Btw: Unit tests in pkg/cluster/cluster_test.go and pkg/cluster/k8sres_test.go are failinng because the event recorder is missing in the constructor

@FxKu Urgh sorry about that, I might have missed /messed-up something when I cherry-picked some commits from my local dev branch to this PR - I'll look into it.

@frittentheke
Copy link
Contributor Author

@FxKu

  1. I did push a new commit to may branch that should fix the unit tests, sorry about missing on that initially. I also moved to the current HEAD of master.
  2. I then just built the operator locally and ran it via minikube (using the provided ./run_operator_locally.sh script) and I received events just fine:
  Normal  Update      2m24s  postgres-operator  Performing rolling update
  Normal  Switchover  118s   postgres-operator  Switching over from "acid-minimal-cluster-0" to "default/acid-minimal-cluster-1"
  Normal  Switchover  117s   postgres-operator  Successfully switched over from "acid-minimal-cluster-0" to "default/acid-minimal-cluster-1"
  Normal  Switchover  117s   postgres-operator  Switchover from "acid-minimal-cluster-0" to "default/acid-minimal-cluster-1" FAILED: <nil>
  Normal  Update      99s    postgres-operator  Rolling update done - pods have been recreated

As I said, I expect that there might be more cases that events should be generated for (to properly allow users to follow the life-cycle of their clusters and there might still be bugs). But you not seeing any events at all seems strange.

@FxKu
Copy link
Member

FxKu commented Apr 20, 2020

Thanks for the update. It's interesting: When updating my operator deployment I can see events for running clusters. But when I create a new cluster I get these Could not construct reference to ... errors for everything - add, update and delete events - until the first sync I guess. Seems that new clusters are not registered correctly. Hmm...

Operator logs when it's working (maybe that helps):
level=debug msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"default\", Name:\"acid-minimal-cluster\", UID:\"2dce694f-2f83-45a4-bfc2-8409280917b2\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"1754\", FieldPath:\"\"}): type: 'Normal' reason: 'Update' Performing rolling update"

@FxKu FxKu added this to the 1.5 milestone Apr 20, 2020
@frittentheke
Copy link
Contributor Author

@FxKu yeah I observe the same errors now. Very strange that it complains about

no kind is registered for the type v1.Postgresql in scheme "k8s.io/client-go/kubernetes/scheme/register.go:69

for new clusters.

@frittentheke
Copy link
Contributor Author

@FxKu apparently one has to get a proper reference to use when sending events.
Why and how this worked for some of those events - I don't know.

I added this with commit 633e001 now ... and heureka, we have many many more events:

[...]
      superuser
      createdb
  Volume:
    Size:  1Gi
Status:
  Postgres Cluster Status:  Running
Events:
  Type    Reason       Age    From               Message
  ----    ------       ----   ----               -------
  Normal  Create       4m16s  postgres-operator  Started creation of new cluster resources
  Normal  Endpoints    4m16s  postgres-operator  Endpoint "default/acid-minimal-cluster" has been successfully created
  Normal  Services     4m16s  postgres-operator  The service "default/acid-minimal-cluster" for role master has been successfully created
  Normal  Services     4m16s  postgres-operator  The service "default/acid-minimal-cluster-repl" for role replica has been successfully created
  Normal  Secrets      4m14s  postgres-operator  The secrets have been successfully created
  Normal  StatefulSet  4m14s  postgres-operator  Statefulset "default/acid-minimal-cluster" has been successfully created
  Normal  StatefulSet  3m26s  postgres-operator  Pods are ready

@FxKu
Copy link
Member

FxKu commented Apr 24, 2020

Tested it and it works now. Very nice new feature. Thanks a lot @frittentheke

@FxKu
Copy link
Member

FxKu commented Apr 24, 2020

👍

1 similar comment
@sdudoladov
Copy link
Member

👍

@FxKu FxKu merged commit 21b9b6f into zalando:master Apr 27, 2020
@@ -1136,6 +1163,7 @@ func (c *Cluster) Switchover(curMaster *v1.Pod, candidate spec.NamespacedName) e
// close the label waiting channel no sooner than the waiting goroutine terminates.
close(podLabelErr)

c.eventRecorder.Eventf(c.GetReference(), v1.EventTypeNormal, "Switchover", "Switchover from %q to %q FAILED: %v", curMaster.Name, candidate, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this should be wrapped in if err != nil {}
FAILED: <nil> is technically correct, but unexpected when the switchover was successful

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened issue #1134

@frittentheke frittentheke deleted the events branch October 1, 2020 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants