Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental gather and must-gather commands #1368

Merged
merged 5 commits into from Sep 27, 2023

Conversation

tnozicka
Copy link
Member

@tnozicka tnozicka commented Sep 1, 2023

Screenshot from 2023-09-15 14-07-35
Description of your changes:
This PR introduces 2 new commands for the scylla-operator binary that help collecting debug dumps.

must-gather collects a predefined set of resources in the Kubernetes cluster that relate to scylla-operator APIs or the cluster state itself. It also has a flag (--all-resources) that can collect every namespaced and non-namespaced resource that is "listable" which is useful in case the default dump wouldn't be enough to identify the issue or for our CI.

gather is aimed at collecting individual resources (+ related objects by default) and helps to get extra data in case there would need to ask for an extra resource to be collected.

All commands redact Secret's data except for well known Kubernetes keys ("ca.crt", "tls.crt", "service-ca.crt") that hold public parts of certificates.

Which issue is resolved by this Pull Request:
Resolves #1365 #1394

Also shaves off about 35 minutes from our e2e suites!

/cc

@scylla-operator-bot
Copy link
Contributor

@tnozicka: GitHub didn't allow me to request PR reviews from the following users: tnozicka.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Description of your changes:
This PR introduces 2 new commands for the scylla-operator binary that help collecting debug dumps.

must-gather collects a predefined set of resources in the Kubernetes cluster that relate to scylla-operator APIs or the cluster state itself. It also has a flag (--all-resources) that can collect every namesapced and non-namespaced resource that is "listable" which is useful in case the default dump wouldn't be enough to identify the issue or for our CI.

gather is aimed at collecting individual resources (+ related objects by default) and helps to get extra data in case there would need to ask for an extra resource to be collected.

All commands redact Secret's data except for well known Kubernetes keys ("ca.crt", "tls.crt", "service-ca.crt") that hold public parts of certificates.

Which issue is resolved by this Pull Request:
Resolves #1365

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 1, 2023
@tnozicka tnozicka marked this pull request as draft September 1, 2023 14:35
@scylla-operator-bot scylla-operator-bot bot added area/dependency Issues or PRs related to dependency changes approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 1, 2023
@tnozicka tnozicka force-pushed the must-gather branch 2 times, most recently from 4d9390b to cd6a04c Compare September 1, 2023 15:38
@scylladb scylladb deleted a comment from scylla-operator-bot bot Sep 1, 2023
@tnozicka
Copy link
Member Author

tnozicka commented Sep 6, 2023

/test images

@tnozicka tnozicka marked this pull request as ready for review September 6, 2023 15:28
@tnozicka tnozicka force-pushed the must-gather branch 6 times, most recently from 07f5f3b to 18e1ecc Compare September 7, 2023 13:45
@tnozicka tnozicka force-pushed the must-gather branch 3 times, most recently from 39f8974 to dc4de15 Compare September 15, 2023 08:21
@tnozicka tnozicka changed the title [WIP] Add experimental gather and must-gather commands Add experimental gather and must-gather commands Sep 15, 2023
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 15, 2023
@tnozicka
Copy link
Member Author

/retest
(infra)

pkg/helpers/array.go Outdated Show resolved Hide resolved
Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
lgtm, waiting for other reviewers

pkg/cmd/operator/gatherbase.go Outdated Show resolved Hide resolved
pkg/cmd/operator/gatherbase.go Show resolved Hide resolved
pkg/cmd/operator/gatherbase.go Outdated Show resolved Hide resolved
pkg/cmd/operator/gather.go Show resolved Hide resolved
pkg/cmd/operator/gather.go Show resolved Hide resolved
pkg/gather/collect/collect.go Outdated Show resolved Hide resolved
pkg/gather/collect/collect.go Outdated Show resolved Hide resolved
pkg/gather/collect/collect.go Show resolved Hide resolved
pkg/gather/collect/collect.go Outdated Show resolved Hide resolved
test/e2e/framework/dump.go Outdated Show resolved Hide resolved
@tnozicka
Copy link
Member Author

/hold
(for final manual check of gathered resources)

@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2023
@tnozicka tnozicka requested a review from zimnx September 27, 2023 08:27
Copy link
Collaborator

@zimnx zimnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2023
@scylla-operator-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tnozicka
Copy link
Member Author

@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 27, 2023
@scylla-operator-bot scylla-operator-bot bot merged commit c2636d0 into scylladb:master Sep 27, 2023
11 checks passed
@tnozicka tnozicka deleted the must-gather branch September 27, 2023 16:35
@mykaul
Copy link
Contributor

mykaul commented Sep 28, 2023

Is QA using it now in their tests? @rayakurl ?

@rayakurl
Copy link

Is QA using it now in their tests? @rayakurl ?

We are not using this in cloud (we are using the cloud CLI so perhaps it will be wrapped in CLI and we will use if needed), I think that this question should be addressed to @fruch and @vponomaryov , who are testing operator directly

@fruch
Copy link

fruch commented Sep 28, 2023

Is QA using it now in their tests? @rayakurl ?

We are not using this in cloud (we are using the cloud CLI so perhaps it will be wrapped in CLI and we will use if needed), I think that this question should be addressed to @fruch and @vponomaryov , who are testing operator directly

@mykaul we aren't using those commands, I don't think we ever run into a situation we needed those.

We have a much more serve issue, that we don't have the instructions on how to be able to catch scylla coredump on operator setup...

@vponomaryov
Copy link
Contributor

Is QA using it now in their tests? @rayakurl ?

We are not using this in cloud (we are using the cloud CLI so perhaps it will be wrapped in CLI and we will use if needed), I think that this question should be addressed to @fruch and @vponomaryov , who are testing operator directly

@mykaul we aren't using those commands, I don't think we ever run into a situation we needed those.

We have a much more serve issue, that we don't have the instructions on how to be able to catch scylla coredump on operator setup...

I totally agree with @fruch
Being able to get any coredump file is also needed to solve user's problems.

@fruch
Copy link

fruch commented Sep 28, 2023

@mykaul we aren't using those commands, I don't think we ever run into a situation we needed those.

I take some of it back, we are not using those commands, since they are new ( I thought this is some fix)
and we have our own code that collects all the information we need (maybe this is even modeled base on that test code ??)

we wrote a task down, I'm not sure when we'll get to add it to the tests.

@tnozicka
Copy link
Member Author

tnozicka commented Oct 2, 2023

and we have our own code that collects all the information we need (maybe this is even modeled base on that test code ??)

yep, from what I recall QA collects artifacts based on our old bash script so that part should definitely migrate over, but they may wait a few days to get the docs (#1367) for it to have an easier transition

core dumps are kind of an off topic, but I've made #1436 to track it

@fruch
Copy link

fruch commented Oct 2, 2023

and we have our own code that collects all the information we need (maybe this is even modeled base on that test code ??)

yep, from what I recall QA collects artifacts based on our old bash script so that part should definitely migrate over, but they may wait a few days to get the docs (#1367) for it to have an easier transition

it's not a bash script, but python, but it collects all the k8s resources information it can get.

core dumps are kind of an off topic, but I've made #1436 to track it

great, we should be able to explain users on how to setup it up for coredumps to be collected

@zimnx zimnx added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/dependency Issues or PRs related to dependency changes kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add gather and must-gather commands
7 participants