Skip to content

ZEP-0033: registry proxy#37

Merged
AustinAbro321 merged 42 commits intomainfrom
host-network-support
Aug 4, 2025
Merged

ZEP-0033: registry proxy#37
AustinAbro321 merged 42 commits intomainfrom
host-network-support

Conversation

@AustinAbro321
Copy link
Copy Markdown
Member

@AustinAbro321 AustinAbro321 commented Jul 7, 2025

Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
@AustinAbro321 AustinAbro321 changed the title Host network support registry proxy initial proposal Jul 7, 2025
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
@AustinAbro321 AustinAbro321 changed the title registry proxy initial proposal ZEP-0033: registry proxy Jul 8, 2025
@AustinAbro321 AustinAbro321 changed the title ZEP-0033: registry proxy ZEP-0033: registry proxy initial proposal Jul 8, 2025
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
AustinAbro321 and others added 6 commits July 9, 2025 18:25
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <AustinAbro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Comment thread 0032-support-ipv6/README.md Outdated
Comment thread 0032-support-ipv6/README.md Outdated
Comment thread 0032-support-ipv6/README.md Outdated
Comment thread 0032-support-ipv6/README.md Outdated
Comment thread 0032-support-ipv6/README.md Outdated
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Comment thread 0033-registry-proxy/README.md Outdated
Comment thread 0033-registry-proxy/README.md Outdated
Comment thread 0033-registry-proxy/README.md Outdated

The proxy and the registry will connect over mTLS. Zarf will create a certificate authority along with a client and server certificate using the authority. These certificates will be automatically rotated during `zarf init` if they have less than half of their total duration remaining. Users will be able to specify their own certificates through flags on `zarf init`: `--registry-server-cert-file`, `--registry-server-key-file`, `--registry-client-key-file`, and `--registry-client-cert-file`.

A controller will run in the cluster that watches when the proxy daemonset fails to pull an image and spins up the injector when this happens. This is important because when a new node is added to the cluster the injector will not be present for the daemonset to bootstrap itself, and the proxy image will not be cached on the node.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different idea: how about using initcontainer in the daemonset to handle that injector? This way you're not required to use any controller, and it will automatically work for every node without extra work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this would mean that the injector would now be long lived rather than short lived, but we'd need a long lived controller anyway. I'll have to make sure I'm not missing anything, but that does seem like a cleaner approach.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem with an init containers is that the regular container for the proxy won't start until after the init container is complete, and we'll have no way to trigger the pulling of that image or to know when it's done. As you pointed out in a synchronous discussion, we could likely use a sidecar instead of a DaemonSet. I'll have to test that out, but that'd likely be better.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like sidecars will be most likely the best solution, at least for now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whatever reason, I could not get sidecar containers to work, so I have it as it's own daemonset currently.

AustinAbro321 and others added 5 commits July 22, 2025 11:57
Co-authored-by: Maciej Szulik <maciej@defenseunicorns.com>
Signed-off-by: Austin Abro <37223396+AustinAbro321@users.noreply.github.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
@AustinAbro321 AustinAbro321 marked this pull request as ready for review July 23, 2025 19:11
@AustinAbro321 AustinAbro321 changed the title ZEP-0033: registry proxy initial proposal ZEP-0033: registry proxy Jul 23, 2025
@AustinAbro321 AustinAbro321 changed the title ZEP-0033: registry proxy ZEP-0033: registry proxy Jul 23, 2025
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Comment thread 0033-registry-proxy/README.md
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Comment thread 0033-registry-proxy/README.md Outdated
The current NodePort service solution does not support IPv6 as IPv6 does not enable route_localnet which is required to call NodePort services using [::1] ([#90236](https://github.com/kubernetes/kubernetes/issues/90236#issuecomment-624721859)). There is a mandate ([wayback machine link because white house site is flaky ATM](https://web.archive.org/web/20250116092323/https://www.whitehouse.gov/wp-content/uploads/2020/11/M-21-07.pdf)) for government agencies to migrate to IPv6 single stack by end of fiscal year (FY) 2025. Given how often Zarf is used in government environments it's important IPv6 is enabled.
A similar issue also makes the Zarf registry unusable for distros such as OpenShift which blocks rewriting traffic to localhost. In both of these situations, hostPort will not work, however hostNetwork will.

The registry proxy solution comes with security advantages. The registry will only be accessible from within the cluster. This is an advantage over the current solution since NodePort services default to being accessible externally to anyone who can connect to a node. Additionally, we will force the registry to connect to the proxy and Zarf CLI with mTLS. With this approach, the only unencrypted traffic during a kubelet call occurs between the kubelet and proxy, ensuring this traffic never leaves the host. The Zarf CLI will connect directly to the registry over mTLS and Kubernetes port forwards.
Copy link
Copy Markdown
Contributor

@mkcp mkcp Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach, the only unencrypted traffic during a kubelet call occurs between the kubelet and proxy, ensuring this traffic never leaves the host.

🔑

Signed-off-by: Austin Abro <austinabro321@gmail.com>
Signed-off-by: Austin Abro <austinabro321@gmail.com>
Comment thread 0033-registry-proxy/zep.yaml Outdated
Signed-off-by: Austin Abro <austinabro321@gmail.com>

As an administrator of a Kubernetes cluster who wants a greater security posture when using the Zarf registry, I run `zarf init --registry-proxy`.

### Risks and Mitigations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good exploration here

```

The baseline [pod security standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) recommends that pods should not set hostPort or HostNetwork. Users with controllers that enforce these standards, such as Kyverno, will need to make an exemption. Additionally, some distros will disable hostPort and hostNetwork by default and users will need to use admin permissions to allow these features.
For example, OpenShift requires hostPort or hostNetwork pods to be run with a privileged service account while Talos requires that the namespace be privileged for hostPort to be enabled. For this feature to be considered stable, the Zarf documentation must include instructions for which settings to change to enable hostPort / hostNetwork on the most common Kubernetes distributions. Zarf currently has no distro specific documentation, but plans to add this, see ([#3686](https://github.com/zarf-dev/zarf/issues/3686)).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this feature to be considered stable, the Zarf documentation must include instructions for which settings to change to enable hostPort / hostNetwork on the most common Kubernetes distributions

++

When we get here, I expect that taking a few different approaches will help users the most. Some combination of a high level networking architecture overview after Start Here will help orient users, then under Best Practices or Reference we can guide users through granular settings like distro-specific settings.

The baseline [pod security standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) recommends that pods should not set hostPort or HostNetwork. Users with controllers that enforce these standards, such as Kyverno, will need to make an exemption. Additionally, some distros will disable hostPort and hostNetwork by default and users will need to use admin permissions to allow these features.
For example, OpenShift requires hostPort or hostNetwork pods to be run with a privileged service account while Talos requires that the namespace be privileged for hostPort to be enabled. For this feature to be considered stable, the Zarf documentation must include instructions for which settings to change to enable hostPort / hostNetwork on the most common Kubernetes distributions. Zarf currently has no distro specific documentation, but plans to add this, see ([#3686](https://github.com/zarf-dev/zarf/issues/3686)).

The registry is no longer accessible from outside of the cluster by default. Some users may rely on this, and will instead have to setup their own exposed service to connect to the registry.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to target example configs as part of GA?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exposing the registry, no imo. It'll be exposable through standard Kubernetes means, and I believe only a very small percentage of our user base will want to do this.

Copy link
Copy Markdown
Member

@brandtkeller brandtkeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking comments / questions. There are a lot of benefits wrapped into this proposal.

below is for the real nitty-gritty.
-->

A new `--registry-proxy` flag will be added to zarf init. Enabling this flag causes Zarf to create a DaemonSet running a proxy on each node that will connect directly to the registry service. Both the injector and proxy will require DaemonSets, and the injector will be long lived. Eventually, `--registry-proxy` may default to true.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benefits of the Injector being long-lived are good - more resilient to restarts to the registry.

Both the injector and proxy will require DaemonSets, and the injector will be long lived

This makes sense from the current context of the injector being temporary - could confuse some people as the proxy will be long-lived as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could yeah. We'll definitely want docs explaining how and why the proxy process works


A user can run `--registry-proxy` during `zarf init` and their choice will be saved to the cluster and used on subsequent runs during `init`. If a user wants to switch back to the localhost NodePort solution they must run `zarf init --registry-proxy=false`. If a user runs `zarf init` without the `--registry-proxy` flag on an already initialized cluster, Zarf will continue using the registry setup that was used during the initial init, whether that is the registry proxy or NodePort solution.

The proxy and the registry will connect over mTLS. Zarf will create a certificate authority along with a client and server certificate using the authority. If a certificate has less than half of its total lifecycle remaining, then it will be rotated automatically during `zarf init`. Users will be able to specify their own certificates through flags on `zarf init`: `--registry-server-cert-file`, `--registry-server-key-file`, `--registry-client-key-file`, and `--registry-client-cert-file`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit/Out-of-scope: Do we have any intention of supporting other services wanting to access the Registy - now over https - given that the TLS secrets are not accessible globally in the cluster?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, I'd rather for those use cases change the allowed hostIP and have apps connect to the proxy. We will have to see how this works out as these use cases become more concrete

[testing-guidelines]: https://docs.zarf.dev/contribute/testing/
-->

There should be a test that verifies that `zarf init --registry-proxy` works with both NFTables and IPv6. These should both go through the entire e2e suite.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out-of-scope: would like to see some chaos engineering work its way into these tests - Purposely restart components of the init infrastructure and expect them to recover during package operations.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, or other things like restarting / adding new nodes

@AustinAbro321 AustinAbro321 merged commit e599097 into main Aug 4, 2025
1 check passed
@AustinAbro321 AustinAbro321 deleted the host-network-support branch August 4, 2025 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants