Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0089: Non-falsifiable provenance support #529

Merged
merged 1 commit into from
Jan 18, 2022

Conversation

priyawadhwa
Copy link
Contributor

@priyawadhwa priyawadhwa commented Oct 4, 2021

No description provided.

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 4, 2021
@bobcatfish
Copy link
Contributor

/assign

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a bunch of questions that, I think, mostly stem from my lack of knowledge about how SPIFFE/SPIRE works.

teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
@vdemeester vdemeester added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Oct 11, 2021
@vdemeester
Copy link
Member

/assign @pritidesai

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 17, 2021
Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding more detail! I've left some initial comments @priyawadhwa - would really like to dig into some of the requirements in more detail and get more clarity on exactly what assurances this will give us.

teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 18, 2021
Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding all the additional detail @priyawadhwa, I think the overall picture is becoming a lot clearer for me!

A couple thoughts compared to the initial POC and proposal:

  • Sounds like we're shifting from securing specifically the results in the TaskRun to securing the entire TaskRun spec and status, is that right?
  • It sounds like we no longer need the individual pods to be connecting to spire - or maybe that's for solving a different problem? (in this proposal someone still modify the termination message of the pod? maybe its actually not possible to modify the termination message unless you're the kubelet...)

teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved

## Proposal
As mentioned above, the basic design looks like this:
1. Tekton Pipelines receives a TaskRun config, and generates the Pod for it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an extra detail, related to my question above about modifications during execution, afaik the pipelines controller will be updating the TaskRun status as the TaskRun executes - I'm wondering if we want to be signing the TaskRun with each modification (and maybe even verifying that nothing has changed since the last modification??)

im not sure if im going too far but it seems like if we dont do that, something else could modify the taskrun status during execution, and the pipelines controller would just incrementally add to those changes, only signing at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, and would probably be good to have. i'll add in some thoughts into the proposal!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of signing the TaskRun with each modification, do you think it could make sense to store the current TaskRun in memory and compare it to the actual thing? maybe something like this:

  1. store the initial TaskRun in memory
  2. when we want to modify the running TaskRun, compare the running TaskRun to the one in memory.
  3. if it doesn't match, then we never request an SVID and skip any SPIRE signing altogether
  4. if it does match, then we modify the TaskRun and store the new modified version in memory, and repeat steps 2-4 until taskrun completes

teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
@priyawadhwa
Copy link
Contributor Author

Sounds like we're shifting from securing specifically the results in the TaskRun to securing the entire TaskRun spec and status, is that right?

Yah, I think the entire spec needs to be secured to guarantee non-falsifiable provenance (since we take the results, steps, parameters and other info from the TaskRun which all ends up in provenance).

It sounds like we no longer need the individual pods to be connecting to spire - or maybe that's for solving a different problem? (in this proposal someone still modify the termination message of the pod? maybe its actually not possible to modify the termination message unless you're the kubelet...)

That's a good point! If the termination message can be modified then we'll also need pods connected to spire so that results can be secured... from the docs it isn't super clear if anyone other than the kubelet has access to the termination message. If we're not sure I think we can just play it safe and also secure results.

@xchapter7x xchapter7x added this to Needs a look in TEP Needs a Look List Nov 30, 2021
@priyawadhwa
Copy link
Contributor Author

Made some updates! I put the termination message issue as a Risk for now, I think it should be easy enough to build signing Results on top of the work in this proposal if it's needed.

I'll try and find out if the termination message can be changed by anyone other than the kubelet in the meantime!

teps/0089-spire-support.md Outdated Show resolved Hide resolved
@imjasonh
Copy link
Member

imjasonh commented Dec 1, 2021

I'll try and find out if the termination message can be changed by anyone other than the kubelet in the meantime!

Looks like they can. 😢

https://gist.github.com/imjasonh/9011395d1f88af02b5bdc5901b739090

This might be another good signal that we should stop relying on termination messages in general.

@priyawadhwa
Copy link
Contributor Author

Thanks @imjasonh! I'll update the TEP.

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits from me, and a couple of bigger things I'm wondering, about, specifically I'm wondering about including the requirement to verify the TaskRun itself.

To fulfill non-falsifiablity do we need to verify the TaskRun, or just the results in the TaskRun - and/or if we DO need to verify the TaskRun, do we need to verify pods and Pipelines instead?

If we do need to verify the TaskRun (and the PipelineRun), I'm wondering about the performance implications - keeping all executing TaskRuns and PipelineRuns in memory has a memory implication but also I think that wouldn't work b/c the controller can be restarted during execution. So I think we'd need to sign and verify the entire TaskRun (maybe PipelineRun) content on every reconcile that updates them which I'm thinking could have a significant performance impact?

Lastly, verifying TaskRuns/PipelineRuns (and/or even pods) will introduce limitations around interoperablity with other mutating admission controllers that users might want to be using (e.g. isito sidecar injection, or the SolarWinds approach to inject tasks into a Pipeline)

teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
@priyawadhwa priyawadhwa force-pushed the spire-support branch 4 times, most recently from 3b01fcd to 557f32a Compare December 15, 2021 01:34
@dibyom
Copy link
Member

dibyom commented Dec 20, 2021

/assign @jerop

@xchapter7x xchapter7x removed this from Needs Approval in TEP Needs a Look List Dec 21, 2021
@dlorenc
Copy link
Contributor

dlorenc commented Dec 23, 2021

This looks great to me now, the wording on exactly what this guarantees is much clearer than I originally came up with :)

@sudo-bmitch
Copy link

One detail to consider is that Spire certificates are going to be short lived (intentionally) so if used for signing persistent data, they may be expired by the time you try to verify them. That shouldn't be a blocker, just something to realize you may need to code around (either by ignoring the expiration, having a grace period, or using a TSA).

Overall, I like where this is going, TEP looks good to me.

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits from me, I think we're generally on the same page tho so nothing blocking on my side!! Thanks for all the back and forth on this

Final couple thoughts from me but not blockers:

  • What is involved / required to install SPIRE itself in a cluster? i'm wondering if there is anything a user would need to be aware of in order to use this feature as far as the SPIRE requirements (i dunno what this would be, maybe something like 'requires an image registry to push to' or something)
  • It might be more clear to rename this TEP to something like "Non-falsifiable provenance support" vs "SPIRE support" (so it's more clear to a reader who isn't familiar with SPIRE how this TEP might be relevant to them)

image

/approve

teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
1. Enabling the alpha feature for SPIRE in Tekton
1. Requesting an SVID & signature over Results for a TaskRun
1. Verification of SPIRE with Chains
1. Verify that a TaskRun that isn't created by Tekton isn't signed Chains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you be planning to set these up as something that runs regularly or something that's done manually to verify the behavior? I think these would be valuable as end to end tests we keep going forward, not 100% sure if we'd prefer to always have spire enabled in our end to end tests (and deployed and configured in the cluster under test) or only do it for a sub set of tests

we'd need to setup a test cluster that has both chains + pipelines in it as well (maybe chains already has this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally it would be something that runs regularly but i'll probably know more once i start actually writing the code 😅

i think we will need to set up a new test cluster, chains doesn't have one as of now!

teps/0089-spire-support.md Outdated Show resolved Hide resolved
teps/0089-spire-support.md Outdated Show resolved Hide resolved
@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2022
@priyawadhwa priyawadhwa changed the title TEP-0089: SPIRE support TEP-0089: Non-falsifiable provenance support Jan 13, 2022

### 2. Tekton Pipelines can't verify that the results it reads weren't modified
The solution to this is Signed Results.
We will modifiy the entrypointer image to sign results with SPIRE once they're available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how fine grained authorization is for the signing keys? e.g. does it allow per-container authorization or is it only for the Pod? I'm wondering if/how we might be able to reduce access to keys from user steps 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few different selectors you can configure, including the container image and container name, so I believe we should be able to reduce access from user steps.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall sounds good, just few questions:

  • This could be an opt-in or opt-out feature right ?
  • If that's the case, I would like to explore the possible problem that could arise, like CSI driver not being present, etc..
  • On the "release" problem if this is enabled or not (because of the need to have a volume mounted in the controller), the operator could help

teps/0089-nonfalsifiable-provenance-support.md Outdated Show resolved Hide resolved
@priyawadhwa
Copy link
Contributor Author

Thanks @bobcatfish, and @vdemeester! To answer some of your questions:

What is involved / required to install SPIRE itself in a cluster?

I might need to look into this a little more, but I definitely think we'll need to add this to any documentation around this feature. In addition to the k8s yaml described in the docs, it looks like users will need to install the k8s node attestor which requires some setup within the cluster.

It might be more clear to rename this TEP to something like "Non-falsifiable provenance support"

SGTM

This could be an opt-in or opt-out feature right ?

Yup, it'll be opt-in as an experimental feature.

If that's the case, I would like to explore the possible problem that could arise, like CSI driver not being present, etc..

Right now, I'm just planning to clearly document the required tools users will need to install before using the feature. We can definitely consider better ways of handling potential errors in the future.

On the "release" problem if this is enabled or not (because of the need to have a volume mounted in the controller), the operator could help

That's a good idea! I'll add that into the TEP as a potential option for releasing (@bobcatfish had also suggested doing two separate releases, one with the volume and one without the volume).

@bobcatfish
Copy link
Contributor

sgtm! @jerop and @pritidesai i think you are currently the outstanding approvers on this one

@vdemeester
Copy link
Member

Thanks @priyawadhwa , LGTM. We have time to figure things out 😇

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @priyawadhwa!

"Marking a TaskRun as failed" has a specific meaning right now - that we mark ConditionSucceeded as false - so it'd be great to clarify in the proposal that it means adding an annotation to the TaskRun indicating that the verification of the TaskRun failed (understood this is what was meant from reading other discussions here, that it's not really failing the TaskRun)

Also had a few questions below

(Please squash the commits)

teps/0089-nonfalsifiable-provenance-support.md Outdated Show resolved Hide resolved
teps/0089-nonfalsifiable-provenance-support.md Outdated Show resolved Hide resolved
As mentioned above, the basic design looks like this (this is meant to be high level and still needs to be fleshed out a bit):
1. Tekton Pipelines receives a TaskRun config, and generates the Pod for it with SPIRE mounted in
1. The Pod executes, and the entrypointer requests an SVID and signature over the Results
1. Tekton Pipelines verifies the Results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we referring to task results here? Task results are part of the taskRun status. So the entrypoint is requesting SVID and signature for that particular result? lets say if my task has 10 different results, will I see those many SVIDs and signatures? Are these SVIDs and signatures also stored as annotations in addition to the SVIDs and signature of the taskRun itself?

Copy link
Member

@pritidesai pritidesai Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVID for a taskRun could be spiffe://tekton.dev/<taskrun> and for task result it could be spiffe://tekton.dev/<taskrun>/result/<result> or spiffe://tekton.dev/<taskrun>/<result> 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we're referring to task results. Each result will have an associated signature, and all signatures can be verified with one SVID. These will be stored as annotations.

Are these SVIDs and signatures also stored as annotations in addition to the SVIDs and signature of the taskRun itself?

Yup, ultimately we will have signatures and an SVID for the Task results, and we'll also have a signature and an SVID over the entire TaskRun.

1. Tekton Pipelines receives a TaskRun config, and generates the Pod for it with SPIRE mounted in
1. The Pod executes, and the entrypointer requests an SVID and signature over the Results
1. Tekton Pipelines verifies the Results
1. Meanwhile, Tekton Pipelines has been verifying that the TaskRun hasn't been modified during execution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please elaborate a little on TaskRun, I think TaskRun here means task specifications (including input params) either through taskRef or taskSpec 🤔 Or does it also include status since status is modified by the pipeline controller until the execution is over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i basically meant it to include the entire TaskRun yaml that's being executed, and i'm guessing the status should be included as well.

@pritidesai
Copy link
Member

thanks @priyawadhwa for this proposal, excited to see this moving forward 🎉

please squash all the commits, add a little description in the PR, and update the last-updated in the TEP.

The proposal is very high level at this point like you mentioned. I would love to see more details when we move this TEP to implementable specifically adding examples and some visuals if possible 🙏

/approve

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, jerop, pritidesai, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@priyawadhwa
Copy link
Contributor Author

Thanks @jerop and @pritidesai! I just squashed the commits and added in some more details based on your comments.

@bobcatfish
Copy link
Contributor

We've got approvals from all the approvers so I'm going to go ahead and lgtm this so it will be merged as proposed - @priyawadhwa the next step would be to open a PR to update the status to "implementable" - @pritidesai mentioned including some more examples and visuals as part of that PR as well (#529 (comment)) - maybe an architecture diagram and/or a sequence diagram?

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 18, 2022
@tekton-robot tekton-robot merged commit 7810fa6 into tektoncd:main Jan 18, 2022
@priyawadhwa priyawadhwa deleted the spire-support branch January 18, 2022 20:55
jerop added a commit to jerop/community that referenced this pull request Jan 27, 2022
This change fixes the date on TEP-0089 to pass the PipelineRun on
open TEPs.

The `pull-community-teps-lint` PipelineRun did not run when we
merged TEP-0089 in tektoncd#529.
It got merged with a linting issue - wrong date - that's causing
the same PipelineRun to fail in open TEPs.
@jerop jerop mentioned this pull request Jan 27, 2022
tekton-robot pushed a commit that referenced this pull request Jan 27, 2022
This change fixes the date on TEP-0089 to pass the PipelineRun on
open TEPs.

The `pull-community-teps-lint` PipelineRun did not run when we
merged TEP-0089 in #529.
It got merged with a linting issue - wrong date - that's causing
the same PipelineRun to fail in open TEPs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Opened
Development

Successfully merging this pull request may close these issues.

None yet