Skip to content

Use container registry for authenticating kubernetes image pull #3771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from

Conversation

AlexejPenner
Copy link
Contributor

Describe changes

Claude and I implemented the missing feature that allows ZenML to make appropriate credentials available to the kubernetes pods for image pulling.

I have validated that this works for my specific usecase - I'm happy to iterate on this and to discuss decisions.

I am also not sure how this will affect legacy customers that might be setting image pull secrets explicitly

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@AlexejPenner AlexejPenner requested a review from schustmi June 24, 2025 22:25
Copy link
Contributor

coderabbitai bot commented Jun 24, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added internal To filter out internal PRs and issues enhancement New feature or request labels Jun 24, 2025
@stefannica stefannica self-requested a review June 25, 2025 07:24
Copy link
Contributor

@stefannica stefannica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving this, because it (almost) hits all the right notes:

  • docker credentials might be temporary, especially if they are coming from service connectors. Your solution always fetches the latest credentials and patches the existing secret instead of configuring the secret once and then forgetting about it. This is one of the challenges I feared we'd hit with this feature and you nailed it.
  • you might have different credentials for different pipelines, so you're not attaching the credentials to the service account, you're attaching them directly to the pods. This offers a nice isolation where secrets are only used by the pods that need them.
  • the secrets are only needed when the pod starts, then you can forget about them. This is what's still missing: these secrets will pile up (each of them uses the pipeline run ID/name, right) until you have thousands of secrets. You need to clean up the secrets that are no longer in use - e.g. the k8s orchestrator pod could clean up after itself.

Aside from this, there are some design concerns that Claude didn't get right.

Comment on lines 53 to 62
if registry_uri in ("docker.io", "index.docker.io"):
registry_server = "https://index.docker.io/v1/"
else:
# Extract just the domain part for better image matching
if registry_uri.startswith(("http://", "https://")):
registry_server = registry_uri
else:
# For URIs like 'registry.onstackit.cloud/zenml', use just the domain
domain = registry_uri.split("/")[0]
registry_server = f"https://{domain}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: this should be part of the base container registry class, maybe merge it into _get_container_registry_credentials.

The rule of thumb is basically this:

  • if the code changes from one container registry flavor to another, it belongs in the container registry base class, or better yet, the base class + implementations, if it's something that a subclass might override
  • if the code is specific to kubernetes, it belongs here

- secret_manifests: List of Kubernetes secret manifests to create
- local_object_references: List of V1LocalObjectReference objects for imagePullSecrets
"""
credentials = _get_container_registry_credentials()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a cleaner way to do this from a design perspective. Let me explain the thought process:

  • first, you have to ask yourself "what is the main area of concern for this utility python module ?". The answer is "building k8s manifests and nothing else". This should make it oblivious to things like stacks and ZenML Client(s). Basically, if your design is a tree of components linked by dependencies, this module should be a low-level leaf: it should know nothing about anything else in your design and therefore has no external dependencies (manifested here as imports). Or if you use the layered design / onion analogy, this is an outermost layer, with no dependencies.
  • next, you realize this new utility function needs container registry credentials. So how do you get them here ?
    • option 1. (currently implemented): call out to the container registry and fetch the credentials. But this makes the module dependent on inner details like the Client or active stack and that spaghettifies the design and its dependency tree and we want to avoid that.
    • option 2. (recommended): use a "dependency injection" - have the other inner components that use this utility module do the grunt work of extracting the container credentials and injecting them into this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed 100%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found the local. path function as well there that violates these design principles so I will be fixing that too

@stefannica
Copy link
Contributor

I am also not sure how this will affect legacy customers that might be setting image pull secrets explicitly

You need to do the research to figure this out.

Copy link
Contributor

@schustmi schustmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a more general note, your change means that the kubernetes orchestrator and step operator are now broken for all users where the service account does not have permissions to create secrets. This feature should probably be configurable, whether its enabled by default is debatable I think.

@@ -108,7 +259,9 @@ def build_pod_manifest(
env: Optional[Dict[str, str]] = None,
mount_local_stores: bool = False,
owner_references: Optional[List[k8s_client.V1OwnerReference]] = None,
) -> k8s_client.V1Pod:
auto_generate_image_pull_secrets: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place to put this. The purpose of this function is to create a configurable pod manifest. In my opinion, you should be able to pass previously generated image pull secrets into this function, they should not be created as part of it.

@schustmi
Copy link
Contributor

schustmi commented Jun 25, 2025

Adding on top of what @stefannica already mentioned: Those Docker credentials are short-lived and expire eventually. Currently, you're creating them client-side, which means they should be valid at least until the orchestrator pod spins up. However, let's say I have a sequential pipeline with three steps, each of them taking 20 hours to complete. When we get to running the third step, those credentials might already be expired. I think the solution for that would be to create a new (or refresh the existing) secret right before each step pod gets created.

@stefannica
Copy link
Contributor

On a more general note, your change means that the kubernetes orchestrator and step operator are now broken for all users where the service account does not have permissions to create secrets. This feature should probably be configurable, whether its enabled by default is debatable I think.

@schustmi don't we already need permission to create secrets to store the ZenML API token as a secret ?

@schustmi
Copy link
Contributor

On a more general note, your change means that the kubernetes orchestrator and step operator are now broken for all users where the service account does not have permissions to create secrets. This feature should probably be configurable, whether its enabled by default is debatable I think.

@schustmi don't we already need permission to create secrets to store the ZenML API token as a secret ?

For the same reason as I mentioned here, I made this an optional feature and even disabled it by default. So for most people (which haven't explicitly enabled this), their service accounts for now do not require those permissions.

)
if secret:
return secret.username, secret.password

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefannica according to your comment the connector should be checked first, but in the credentials implementation it was the other way around - let me know if I am missing something or if it made sense to flip here

- secret_manifests: List of Kubernetes secret manifests to create
- local_object_references: List of V1LocalObjectReference objects for imagePullSecrets
"""
credentials = _get_container_registry_credentials()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed 100%

- secret_manifests: List of Kubernetes secret manifests to create
- local_object_references: List of V1LocalObjectReference objects for imagePullSecrets
"""
credentials = _get_container_registry_credentials()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found the local. path function as well there that violates these design principles so I will be fixing that too

Comment on lines +27 to +43
class DockerHubContainerRegistry(BaseContainerRegistry):
"""Container registry implementation for DockerHub."""

@property
def registry_server_uri(self) -> str:
"""Get the DockerHub registry server URI.

DockerHub requires authentication against the specific
'https://index.docker.io/v1/' endpoint regardless of how
the registry URI is configured.

Returns:
The DockerHub registry server URI.
"""
return "https://index.docker.io/v1/"


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure on this one as container registries seem like the ugly duckling among our stack components with very little actual implementation of the base classes.

@@ -339,6 +356,16 @@ def finalize_run(node_states: Dict[str, NodeStatus]) -> None:
except k8s_client.rest.ApiException as e:
logger.error(f"Error cleaning up secret {secret_name}: {e}")

# Clean up old imagePullSecrets to prevent accumulation
# Only clean up for non-scheduled runs to avoid interfering with running schedules
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schustmi you mentioned that htese are temporary - what would I need to do for scheduled pipelines?

@AlexejPenner AlexejPenner marked this pull request as ready for review July 2, 2025 09:22
@AlexejPenner
Copy link
Contributor Author

I tested it on my use-case and things worked - once we both have some time I would take you up on your offer to test together in more detail @stefannica

@AlexejPenner AlexejPenner requested a review from stefannica July 2, 2025 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request internal To filter out internal PRs and issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants