Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integrity information to must-gather #1848

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tnozicka
Copy link
Member

Description of your changes:
This PR add integrity information to must-gather dump. This will help to identify whether files have been removed or the collection failed.

Which issue is resolved by this Pull Request:
Resolves #1847

@tnozicka tnozicka added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 19, 2024
@scylla-operator-bot scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 19, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 19, 2024
@tnozicka tnozicka force-pushed the must-gather branch 2 times, most recently from e681594 to 88c93a5 Compare March 19, 2024 14:12
@tnozicka tnozicka changed the title [WIP] Add integrity information to must-gather Add integrity information to must-gather Mar 19, 2024
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2024
Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had a brief look - wouldn't a merkle tree be a standard approach for this use case?

@tnozicka
Copy link
Member Author

I only had a brief look - wouldn't a merkle tree be a standard approach for this use case?

It's valid but more for cryptography.

The approach here allows you to easily identify the places with changes with respect to the particular resource and couples well with linux CLI tools. Say, to verify a folder manually, you can run find scylla-operator-must-gather/namespaces/scylla-operator/serviceaccounts -mindepth 1 -exec cat {} \; | sha512sum - and you have the explicit count so you at least know how many files are missing.

@rzetelskik
Copy link
Member

rzetelskik commented Apr 17, 2024

I only had a brief look - wouldn't a merkle tree be a standard approach for this use case?

It's valid but more for cryptography.

The approach here allows you to easily identify the places with changes with respect to the particular resource and couples well with linux CLI tools. Say, to verify a folder manually, you can run find scylla-operator-must-gather/namespaces/scylla-operator/serviceaccounts -mindepth 1 -exec cat {} ; | sha512sum -

would this approach also allow you to check if a given file was originally a part of the archive?

and you have the explicit count so you at least know how many files are missing.

is that actually a valuable information if you don't know which ones are missing?

Since you already put the work into this I don't feel strong about using the merkle tree (altough there probable are many existing implementations), I'm struggling to understand what is the purpose of this exactly if we don't have the capability of signing it (correct me if I'm wrong here)? The PR description says This will help to identify whether files have been removed or the collection failed, so it's working on an assumption that someone purposefully tampered with the archive, so they are already "malicious" (or non-compliant?) - what's to stop them from tampering with the integrity information as well? Would you be able to tell the modified archive apart from an archive in which some manifests just weren't collected?

@tnozicka
Copy link
Member Author

would this approach also allow you to check if a given file was originally a part of the archive?

Transiently. But anything can be messed up with, it's just the amount of work that needs to be put in. Keep in mind this is meant primarily for detecting removed files, not to remove some and add fake files. At a point where the folder doesn't match we can just ask for a new collection.

We could extent this to go at the file level.

is that actually a valuable information if you don't know which ones are missing?

The most pressing concern is temper detection, after that point I'd usually just not proceed further.

Since you already put the work into this I don't feel strong about using the merkle tree (altough there probable are many existing implementations)

Using a merkle tree it is not easy or possible to use standard linux tooling for verification and you have to add extra commands. It's a valid option though.

what's to stop them from tampering with the integrity information as well?

nothing, with any approach we take. they are tempering with the archive with the intent to limit the data to what they think is not related (usually wrongly), not to be intentfully malicious.

@scylla-operator-bot scylla-operator-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 19, 2024
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rzetelskik
Copy link
Member

is that actually a valuable information if you don't know which ones are missing?

The most pressing concern is temper detection, after that point I'd usually just not proceed further.

Are you saying that we wouldn't investigate a customer issue if they removed any file, even if we weren't sure it was necessarily related to Scylla?

what's to stop them from tampering with the integrity information as well?

nothing, with any approach we take. they are tempering with the archive with the intent to limit the data to what they think is not related (usually wrongly), not to be intentfully malicious.

I mean, I get the point, but at this point you'd have to assume they deleted a related file and assume a somewhat hostile approach. I'm not sure how that'd turn out in practice.

Before we go into reviewing the implementation, @zimnx what are your thoughts on this?

@zimnx
Copy link
Collaborator

zimnx commented May 6, 2024

We shouldn't reject the dump if user removed anything.
I totally understand users removing Secrets, as even they are redacted, the dump keeps base64 encoded value which may confuse user to think they aren't because nobody has base64 decoder in their eyes.
Not everyone would want to share their full deployment details, which is also fine as they may be sensitive, so they may remove whole namespaces or even Pods/ConfigMap/other resources from namespace where ScyllaCluster is running.

I think having names of these removed resources would allow us to judge - with high enough confidence I think - whether they were related.
With just number of files within directory we can't tell whether they removed client Pod, or cleanup job Pod or any other our resource that was expected.
This solution is missing them.

Merkle trees would be a good fit, but they come with a cost of having troubles extracting anything from them. We would need to develop tools to check integrity or dump file names.
This solution is human readable with already existing linux tooling being able to validate checksums.
As long as solution is fast and storage efficient, I don't think we need a fancy one.

Copy link
Contributor

@tnozicka: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-release-script-latest 68232b5 link true /test e2e-gke-release-script-latest

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2024
@tnozicka
Copy link
Member Author

/remove-lifecycle stale
/triage accepted

@scylla-operator-bot scylla-operator-bot bot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 22, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2024
@tnozicka tnozicka removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compute integrity for must-gather dumps
3 participants