Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cosigned to take advantage of duck typing. #637

Merged
merged 3 commits into from
Sep 11, 2021

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Sep 8, 2021

With this change, the webhook can take advantage of duck typing to parse all of the "Pod Specable" types currently supported.

This also takes advantage of the knative.dev/pkg webhook infrastructure to reduce boilerplate and eliminate the need for cert-manager.

Lastly, this starts to sketch out some cosigned e2e tests to verify that things work.

Signed-off-by: Matt Moore mattomata@gmail.com

@mattmoor
Copy link
Member Author

mattmoor commented Sep 8, 2021

TODO list for myself:

  • Change the exclude label to an include label, and label the namespace for tests.
  • Add coverage for a PodSpecable in addition to Pod (given ./cmd/sample maybe Job?)
  • Split off the :nonroot base image as separate PR.
  • Split off the fulcioroot package split as separate PR.

@mattmoor mattmoor force-pushed the knative-pkg branch 2 times, most recently from 345ec18 to 9da1f06 Compare September 8, 2021 18:28
@mattmoor
Copy link
Member Author

mattmoor commented Sep 8, 2021

Ok, added a Job test, and broke off: #638 and #639 (will rebase this when those merge)

@mattmoor mattmoor force-pushed the knative-pkg branch 2 times, most recently from 9d14f2e to a2953f9 Compare September 8, 2021 20:10
@mattmoor mattmoor changed the title [WIP] Refactor cosigned to take advantage of duck typing. Refactor cosigned to take advantage of duck typing. Sep 8, 2021
With this change, the webhook can take advantage of duck typing to parse all of the "Pod Specable" types currently supported.

This also takes advantage of the `knative.dev/pkg` webhook infrastructure to reduce boilerplate and eliminate the need for `cert-manager`.

Lastly, this starts to sketch out some cosigned e2e tests to verify that things work.

Signed-off-by: Matt Moore <mattomata@gmail.com>
@dlorenc
Copy link
Member

dlorenc commented Sep 8, 2021

Cc @hectorj2f FYI

This looks good to me, but let me know if you have any concerns

import "log"

func main() {
log.Printf("Hello, World!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an unnecessary piece of code to maintain. Can you explain why you added it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An image is published with this code to a registry running alongside KinD in the e2e tests. We use this image to verify that:

  1. Unsigned images are fine in namespaces not subject to the webhook,
  2. Unsigned images are rejected in namespaces subject to the webhook,
  3. Signed images are fine in namespaces subject to the webhook.

Personally, I'm less worried about maintaining "Hello world" than I am about maintaining a published image somewhere and logic to copy it down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsigned images are fine in namespaces not subject to the webhook,
Unsigned images are rejected in namespaces subject to the webhook,
Signed images are fine in namespaces subject to the webhook.

Couldn't we build and publish to this registry running alongside the our cosign image without being signed ? So we can test it using our own docker image.

On the other hand, if this code is only used for testing purposes. Would it be a better choice to move it to the test directory ? wdyt ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what this does. ko publish this test image to the local registry, so we don't need to rely on a prebuilt image. I'd be happy to move this to a more clearly named directory, I just didn't see a precedent to follow yet.


- name: Collect diagnostics
if: ${{ failure() }}
run: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use scripts (potentially stored in a hack directory) instead of this long yaml file ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this logic is to dump useful diagnostic information from the ephemeral cluster before it is torn down. If we find ourselves creating more workflows that copy/paste this (within the same repo) then it might be useful to extract, but I don't think it's useful for local environments because they stick around for post-mortem inspection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I simple solution would have been to use kind export logs to a directory so it can be available as a bundle and download it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can certainly change this to do that. This snippet is just one I've been using across a variety of projects for this, and it's nice to be able to jump to directly the pod logs for the component you care about.

In Knative we uploaded the full logs to GCS, and they got very very large. Not sure this will end up with as much logging or as many tests, but we can tweak this however we want 👍

# Wait for the webhook to come up and become Ready
kubectl rollout status --timeout 5m --namespace cosign-system deployments/webhook

- name: Run Tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather prefer using ginkgo e2e tests instead of this raw script. It is easier to maintain and less error prone. We were planing to add e2e tests next week too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again instead of having the whole piece of code here, it would be easier to maintain if we call scripts instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created ./test/e2e_test_cosigned.sh with this for now, which mostly just outlines what was here.

My goal here wasn't to set a precedent for how we build all future tests, but to at least get a workflow set up which validates something presubmit, so we don't end up with another -log_dir situation.

Suffice it to say that I'd be happy to see all of the "tests" rewritten in something better, but wanted some measure of validation before checking this in. Rather than bikeshed on which test framework to rewrite the tests in here, my inclination would be to get some test coverage in, and then we can follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@@ -0,0 +1 @@
../../../../LICENSE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why do we need this here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEAD and refs are symlinked in so that knative.dev/pkg/changeset can infuse various things (e.g. structured logging) with the changeset at which the image was built.

This is here as part of license compliance. Really this repo should also start running github.com/google/go-licenses as well to produce a third_party/VENDOR-LICENSE/... that we symlink here as well for license compliance, but that is sufficiently beyond the scope of my change that I left it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add these kodata files to .gitignore ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't generated, and they are generally checked in because you want the published images to contain the metadata. If there were a way of generating these, we could conceivably use .gitignore, but ko isn't going to create these (kodata is its convention, not the rest)

func main() {
ctx := webhook.WithOptions(signals.NewContext(), webhook.Options{
ServiceName: "webhook",
Port: 8443,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port should remain configurable.
Metrics port and other settings would need to be configurable as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics port (if you are using prometheus) is configurable through config-observability: https://github.com/knative/pkg/blob/9a4b6128207c17418c4524f9f9a07cd9cb3babda/metrics/config.go#L101-L104

IIRC 9090 is the typical prometheus port, so IDK where 8080 came from. Hosting metrics on a port is also kind of an anti-pattern because it doesn't work well in ephemeral environments (a la serverless), so other metrics options (like stackdriver) push metrics and don't expose a port.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back. The configmap controls a nested configuration. The prometheus port is configurable through environment variables: https://github.com/knative/pkg/blob/50410e0b833abc1a464d334b9e52e2c873dafcd4/metrics/config.go#L55-L56

ctx := webhook.WithOptions(signals.NewContext(), webhook.Options{
ServiceName: "webhook",
Port: 8443,
SecretName: "webhook-certs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be the value of secretName ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, secretName holds the name of the signing key. This is the secret that holds the TLS cert, which we reconcile with certificates.NewController, and which the sharedmain logic uses to host a TLS-terminated endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider updating the description on L39 and expanding which secret it is, sumtin like:

The name of the secret in the webhook's namespace that holds the TLS certificate used for signing

Or maybe change the flag from secret-name to signing-secret-name or just signing-secret?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description, this is for verification (vs. signing). I was thinking earlier that we might be able to just use a configmap for this since it's mainly intended for the public key (vs. proper "secret" data), but that's beyond the scope of the change.


// Add healthz and readyz handlers to webhook server. The controller-runtime AddHealthzCheck/AddReadyzCheck methods
// are served via separate http server - better to serve these from the same webhook http server.
webhookServer.WebhookMux.Handle("/readyz/", http.StripPrefix("/readyz/", &healthz.Handler{}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are we losing the healthz and readyz checks with these changes ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the Knative webhook logic has relatively sophisticated logic to respond to probes. You can see these configured in config/webhook.yaml.

# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to move the all the installation scripts within our helm charts, as done here #609. We have a PR open on the sigstore/helm-charts repol.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found that very confusing.

What's the plan to write e2e tests to validate changes presubmit? How do you validate changes presubmit that need both code and config changes (this PR itself is a somewhat dramatic example of that)?

Personally, I think it's fine to have Helm-specific stuff separate, and we can rationalize that when this lands (that still hadn't when I looked yesterday). My $0.02 is that Helm shouldn't be a requirement to install this, and the "lowest common denominator" is yaml configs, which is effectively all this is (ko resolve -f config > release.yaml is how quite a few projects are producing rendered release yaml).

verbs: ["create"]

# Allow the reconciliation of exactly our validating webhook.
- apiGroups: ["admissionregistration.k8s.io"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need all these permissions ? In the original chart, we don't need more than those https://github.com/sigstore/helm-charts/pull/10/files#diff-dba06e52f4da92f91dbb6c70da49d6d5c18f8822eb0516d76fba88c436689a0eR17.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code also needed cert-manager, and these permissions are actually extremely narrow if you pick through them.

We reconcile the validating webhook to make sure it has the appropriate caBundle, which is important for certificate rotation. If you look closely, the only webhook this can update is the cosigned.sigstore.dev webhook.

We support event creation (above this) at the cluster scope because we reconcile cluster-scoped resources (the webhook), and need to support attaching events to it when we fail to reconcile it.

The block below this is so the webhook can fetch the UID of the "system" namespace, and use this to create an OwnerReference link between the validating webhook and the namespace. A fairly common "uninstall" failure mode is that folks just YOLO delete the "system" namespace, which can be disastrous if cluster-scoped resources (not within the namespace) are left around. In the case of webhooks, there would remain a webhook, which will 503 indefinitely for any resources within its purview. With this capability, the webhook is cleaned up as soon as the namespace is deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattmoor I was referring to all the yaml files in general. We decided to move the yaml manifests to the helm-charts repo.
cc @cpanato @dlorenc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to help rationalize things here, I just hadn't seen the "dots" connected for how y'all plan to do e2e testing with the configuration in a separate repo. I really wanted e2e tests to help validate things here.

One thing I've done elsewhere is vendor the config files across repos, which is another possible option.

Copy link
Member Author

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments @hectorj2f.

I pushed a couple changes in a second commit, but may amend that commit as needed if I broke something doing it.

# Wait for the webhook to come up and become Ready
kubectl rollout status --timeout 5m --namespace cosign-system deployments/webhook

- name: Run Tests
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created ./test/e2e_test_cosigned.sh with this for now, which mostly just outlines what was here.

My goal here wasn't to set a precedent for how we build all future tests, but to at least get a workflow set up which validates something presubmit, so we don't end up with another -log_dir situation.

Suffice it to say that I'd be happy to see all of the "tests" rewritten in something better, but wanted some measure of validation before checking this in. Rather than bikeshed on which test framework to rewrite the tests in here, my inclination would be to get some test coverage in, and then we can follow up.


- name: Collect diagnostics
if: ${{ failure() }}
run: |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this logic is to dump useful diagnostic information from the ephemeral cluster before it is torn down. If we find ourselves creating more workflows that copy/paste this (within the same repo) then it might be useful to extract, but I don't think it's useful for local environments because they stick around for post-mortem inspection.

@@ -0,0 +1 @@
../../../../LICENSE
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEAD and refs are symlinked in so that knative.dev/pkg/changeset can infuse various things (e.g. structured logging) with the changeset at which the image was built.

This is here as part of license compliance. Really this repo should also start running github.com/google/go-licenses as well to produce a third_party/VENDOR-LICENSE/... that we symlink here as well for license compliance, but that is sufficiently beyond the scope of my change that I left it out.

ctx := webhook.WithOptions(signals.NewContext(), webhook.Options{
ServiceName: "webhook",
Port: 8443,
SecretName: "webhook-certs",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, secretName holds the name of the signing key. This is the secret that holds the TLS cert, which we reconcile with certificates.NewController, and which the sharedmain logic uses to host a TLS-terminated endpoint.

func main() {
ctx := webhook.WithOptions(signals.NewContext(), webhook.Options{
ServiceName: "webhook",
Port: 8443,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics port (if you are using prometheus) is configurable through config-observability: https://github.com/knative/pkg/blob/9a4b6128207c17418c4524f9f9a07cd9cb3babda/metrics/config.go#L101-L104

IIRC 9090 is the typical prometheus port, so IDK where 8080 came from. Hosting metrics on a port is also kind of an anti-pattern because it doesn't work well in ephemeral environments (a la serverless), so other metrics options (like stackdriver) push metrics and don't expose a port.


// Add healthz and readyz handlers to webhook server. The controller-runtime AddHealthzCheck/AddReadyzCheck methods
// are served via separate http server - better to serve these from the same webhook http server.
webhookServer.WebhookMux.Handle("/readyz/", http.StripPrefix("/readyz/", &healthz.Handler{}))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the Knative webhook logic has relatively sophisticated logic to respond to probes. You can see these configured in config/webhook.yaml.

import "log"

func main() {
log.Printf("Hello, World!")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An image is published with this code to a registry running alongside KinD in the e2e tests. We use this image to verify that:

  1. Unsigned images are fine in namespaces not subject to the webhook,
  2. Unsigned images are rejected in namespaces subject to the webhook,
  3. Signed images are fine in namespaces subject to the webhook.

Personally, I'm less worried about maintaining "Hello world" than I am about maintaining a published image somewhere and logic to copy it down.

# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found that very confusing.

What's the plan to write e2e tests to validate changes presubmit? How do you validate changes presubmit that need both code and config changes (this PR itself is a somewhat dramatic example of that)?

Personally, I think it's fine to have Helm-specific stuff separate, and we can rationalize that when this lands (that still hadn't when I looked yesterday). My $0.02 is that Helm shouldn't be a requirement to install this, and the "lowest common denominator" is yaml configs, which is effectively all this is (ko resolve -f config > release.yaml is how quite a few projects are producing rendered release yaml).

verbs: ["create"]

# Allow the reconciliation of exactly our validating webhook.
- apiGroups: ["admissionregistration.k8s.io"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code also needed cert-manager, and these permissions are actually extremely narrow if you pick through them.

We reconcile the validating webhook to make sure it has the appropriate caBundle, which is important for certificate rotation. If you look closely, the only webhook this can update is the cosigned.sigstore.dev webhook.

We support event creation (above this) at the cluster scope because we reconcile cluster-scoped resources (the webhook), and need to support attaching events to it when we fail to reconcile it.

The block below this is so the webhook can fetch the UID of the "system" namespace, and use this to create an OwnerReference link between the validating webhook and the namespace. A fairly common "uninstall" failure mode is that folks just YOLO delete the "system" namespace, which can be disastrous if cluster-scoped resources (not within the namespace) are left around. In the case of webhooks, there would remain a webhook, which will 503 indefinitely for any resources within its purview. With this capability, the webhook is cleaned up as soon as the namespace is deleted.

Comment on lines +74 to +84
readinessProbe: &probe
failureThreshold: 6
initialDelaySeconds: 20
periodSeconds: 1
httpGet:
scheme: HTTPS
port: 8443
httpHeaders:
- name: k-kubelet-probe
value: "webhook"
livenessProbe: *probe
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hectorj2f this is the readiness probe configuration, which validates that the webhook is still serving. The underlying webhook code has logic built in to start failing readiness probes on SIGTERM, but continues to serve traffic until a duration has elapsed without receiving any traffic, or SIGKILL (obvs).

Signed-off-by: Matt Moore <mattomata@gmail.com>
Copy link
Contributor

@vaikas vaikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of nits

.github/workflows/kind-e2e-cosigned.yaml Outdated Show resolved Hide resolved
cmd/cosign/webhook/main.go Outdated Show resolved Hide resolved
config/500-webhook-configuration.yaml Outdated Show resolved Hide resolved
pkg/cosign/kubernetes/webhook/validator.go Show resolved Hide resolved
ctx := webhook.WithOptions(signals.NewContext(), webhook.Options{
ServiceName: "webhook",
Port: 8443,
SecretName: "webhook-certs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider updating the description on L39 and expanding which secret it is, sumtin like:

The name of the secret in the webhook's namespace that holds the TLS certificate used for signing

Or maybe change the flag from secret-name to signing-secret-name or just signing-secret?

@cpanato cpanato added this to the v1.2.0 milestone Sep 9, 2021
…oist and comment webhook name as constant

Signed-off-by: Matt Moore <mattomata@gmail.com>
@hectorj2f
Copy link
Contributor

@mattmoor I will have another look and manually test these changes tomorrow.

@dlorenc dlorenc merged commit fb04df8 into sigstore:main Sep 11, 2021
@mattmoor mattmoor deleted the knative-pkg branch September 13, 2021 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants