Skip to content

ZEP-0035: Feature Flags#41

Merged
mkcp merged 15 commits intomainfrom
mkcp/00XX-feature-flags
Aug 1, 2025
Merged

ZEP-0035: Feature Flags#41
mkcp merged 15 commits intomainfrom
mkcp/00XX-feature-flags

Conversation

@mkcp
Copy link
Copy Markdown
Contributor

@mkcp mkcp commented Jul 15, 2025

  • One-line PR description:
    This ZEP proposes the need for, API, docs, and implementation of a Feature Flagging library in Zarf.

  • Issue link:
    Feature Flags #35

  • Other comments:
    Currently Revision 2, ready for pre-alpha review and potential merge.

mkcp added 2 commits June 30, 2025 11:57
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
@mkcp mkcp self-assigned this Jul 15, 2025
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
@mkcp mkcp mentioned this pull request Jun 25, 2025
6 tasks
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
@mkcp mkcp marked this pull request as draft July 15, 2025 18:25
Comment thread 0035-feature-flags/README.md Outdated
Comment thread 0035-feature-flags/README.md Outdated
- Rollout considerations, like can we _backport_ flags to previous features so that they can be associated with verions
or disabled?

## FIXME Design Details
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important note with Zarf is that anything in the schema will span to both create time and deploy time. If a feature is enabled and used at create time, do we assume that it's enabled for the deployer. I'd lean towards yes, but curious on others thoughts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, yeah. This is maybe a hacky solution, but what if we retrieved the set ofAll() enabled features when creating a package and serialized their names to a build or metadata field?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see that being useful in the event we are changing how a specific feature in the schema works (though maybe if we're doing that it's a sign we should be using a different key).

I was thinking of an even simpler case where we have a gated feature in the schema and someone opts into it, would that feature also be gated on deploy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will depend on whether a particular functionality alters the package or its deployment or not. But storing that information in package metadata is definitely good idea.

Comment thread 0035-feature-flags/README.md Outdated
// if len(flags) == 0 {...}
```

### TODO DISCUSS: Global API
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer global. From what I can tell it looks like Kubernetes declares their feature flags globally on init() https://github.com/kubernetes/kubernetes/blob/c4beea784c4ba2ab4cdf98d13dfe094fcb746288/pkg/features/kube_features.go#L1945

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supportive.

Comment thread 0035-feature-flags/README.md Outdated
StageDeprecated Stage = "deprecated"
)

type Flag struct {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering changing this to Feature to make it a bit more clear what it's describing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating Flag type to Feature in latest

Comment thread 0035-feature-flags/README.md Outdated
### Types
Below is the proposed data model for Feature Flags, including the `Flag` type and supporting fields in the `flag` pkg.
```
pkg flag
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the type name below, considering changing the pkg name to feature to make it a bit more clear what it's describing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIgrating package flag -> feature in latest.

Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Copy link
Copy Markdown
Member

@brandtkeller brandtkeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processing my understanding - appreciative of the huge scope this is appropriately engaging.

Comment thread 0035-feature-flags/README.md Outdated
of release and deprecation, eg. "alpha", "beta", "GA", "deprecated". It introduces a ctx-based API, similar to previous
usages of feature flags in Zarf, a centralized location where all CLI feature flags are **declared** so that users may
easily find Flags and their documentation in code. Additionally, centralizing where Flags are declared allows us to
automate site documentation that corresponds one to one with the docs in code. Inspiration for the Flag fields is taken
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with intent - curious if there is an overlay of alternative designs that uses our already-available utilization of cobra/viper/pflag . Are there constraints present here with the lifecycle exposed by those and our current use? maybe.

pflag has a lot in the way of lifecycle management and the integration with cobra/viper already answers questions for how we consume from config / generate documentation and manage the lifecycle of active/hidden/deprecated flags.

Just want to ensure we weigh that optionality before understanding if we require a bespoke implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the plag will only to handle the flag parsing part of things. This document focuses primarily on the managing the feature gates, enabled through flags. For example, given the following 4 features at various stages:

  1. featureA - stable, enabled by default, can't be disabled.
  2. featureB - beta, enabled by default, can be disabled.
  3. featureC - beta, disabled by default, can be enabled.
  4. featureD - alpha, disabled by default, can be enabled.

A zarf user, by default, has access to featureA and featureB (on by default). If a user wants to enable featureC, one needs to run zarf --feature-gate=featureC=true, if one wants both featureC and featureD - zarf --feature-gate=featureC=true,featureD=true. Alternatively if user wants to disable a feature that is on by default - zarf --feature-gate=featureB=false.

From a developer perspective you have 2 helpful utilities:

  1. A registry of feature gates, ideally with information about its state (alpha, beta, stable), whether it's on or off by default, and lastly if it can be turned off.
  2. A helper method which allows you to check if featureA is on or off, that you then use in your code base.

An example from k8s:

  1. Feature metadata: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/features/kube_features.go#L405-L410 with a pointer to KEP, and history https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/features/kube_features.go#L1315-L1319
  2. Sample usage: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/apis/batch/v1/defaults.go#L62

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, kubectl uses env vars to achieve this:

  1. Feature metadata: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/staging/src/k8s.io/kubectl/pkg/cmd/util/helpers.go#L434-L438
  2. Sample usage: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/staging/src/k8s.io/kubectl/pkg/cmd/cmd.go#L362-L364

Although, I'm planning to simplify the interface for kubectl a bit, since it's a bit clunky atm. The goal is to make it closer to what --feature-gates do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we could just re-use the k8s library for feature gates, rather than rolling yet another one https://github.com/kubernetes/component-base/blob/master/featuregate/feature_gate.go. Or use some other library that I'm pretty exist 😅

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconding what @soltysh is saying here where pflag works great as a kind of CLI frontend for sure, but doesn't provide a flexible enough model or API for features. Especially when it comes to users needing to query or change feature state within code.

Regarding borrowing or reusing the K8s feature-gates - I definitely took a lot of inspiration from the implementation! Reading through the type was my starting point. K8s runtime mutable features which are compelling, as well as multiple features per gate. Maybe more than we need pre-1.0 though?

If are effective with the API and configuration layer I do think we can come back to the implementation and add functionality in or swap out for more if we need it in the future. The config layer (CLI flags, env vars, zarf-config.{yaml,toml}) already does this well imo. Where we'll need to be careful are the abstractions we offer to users for the SDK.


Feature flags are implemented on an opt-in basis. Upgrading to versions with feature flagging supported, and with
toggle-able feature flags ought to be simple and transparent. There are no backwards compatibility concerns for previous
versions of Zarf as feature flags deal with the code itself.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no backwards compatibility concerns for previous versions of Zarf as feature flags deal with the code itself.

Does this consider the impact of the package created and potential drift between the zarf version at creation and the version used during deploy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread 0035-feature-flags/README.md Outdated
- Also if maintainers rely on enabling feature flags at specific versions then this keeps SDK to CLI parity.
- Without global defaults we have to flip the boolean from enabled to disabled when providing a beta feature by default.
- A major drawback with ctx is that it's difficult to inspect for users. "where did this flag come from?"
- Maybe this is a good nudge to get features into GA sooner. More to discuss here.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ It is a great reason for us to have a process that tracks current state and what we require for graduation?

Comment thread 0035-feature-flags/README.md Outdated
-->

- One critical pitfall to avoid with feature flags is creating _unexpected_ behavior, both for end users and maintainers.
Unclear state about which flags are enabled and not, and with and without defaults. We have some solutions below under
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appreciate this - how can we make the propagation throughout the system more observable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that w/ the addition of separate sets for default and user features that sourcing these gets even easier.

Comment thread 0035-feature-flags/README.md Outdated
of release and deprecation, eg. "alpha", "beta", "GA", "deprecated". It introduces a ctx-based API, similar to previous
usages of feature flags in Zarf, a centralized location where all CLI feature flags are **declared** so that users may
easily find Flags and their documentation in code. Additionally, centralizing where Flags are declared allows us to
automate site documentation that corresponds one to one with the docs in code. Inspiration for the Flag fields is taken
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the plag will only to handle the flag parsing part of things. This document focuses primarily on the managing the feature gates, enabled through flags. For example, given the following 4 features at various stages:

  1. featureA - stable, enabled by default, can't be disabled.
  2. featureB - beta, enabled by default, can be disabled.
  3. featureC - beta, disabled by default, can be enabled.
  4. featureD - alpha, disabled by default, can be enabled.

A zarf user, by default, has access to featureA and featureB (on by default). If a user wants to enable featureC, one needs to run zarf --feature-gate=featureC=true, if one wants both featureC and featureD - zarf --feature-gate=featureC=true,featureD=true. Alternatively if user wants to disable a feature that is on by default - zarf --feature-gate=featureB=false.

From a developer perspective you have 2 helpful utilities:

  1. A registry of feature gates, ideally with information about its state (alpha, beta, stable), whether it's on or off by default, and lastly if it can be turned off.
  2. A helper method which allows you to check if featureA is on or off, that you then use in your code base.

An example from k8s:

  1. Feature metadata: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/features/kube_features.go#L405-L410 with a pointer to KEP, and history https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/features/kube_features.go#L1315-L1319
  2. Sample usage: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/apis/batch/v1/defaults.go#L62

Comment thread 0035-feature-flags/README.md Outdated
of release and deprecation, eg. "alpha", "beta", "GA", "deprecated". It introduces a ctx-based API, similar to previous
usages of feature flags in Zarf, a centralized location where all CLI feature flags are **declared** so that users may
easily find Flags and their documentation in code. Additionally, centralizing where Flags are declared allows us to
automate site documentation that corresponds one to one with the docs in code. Inspiration for the Flag fields is taken
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, kubectl uses env vars to achieve this:

  1. Feature metadata: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/staging/src/k8s.io/kubectl/pkg/cmd/util/helpers.go#L434-L438
  2. Sample usage: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/staging/src/k8s.io/kubectl/pkg/cmd/cmd.go#L362-L364

Although, I'm planning to simplify the interface for kubectl a bit, since it's a bit clunky atm. The goal is to make it closer to what --feature-gates do.

Comment thread 0035-feature-flags/README.md Outdated
### Non-Goals

- Elaborate configuration nightmares, where users don't know what is enabled or multiple flags are intertwined.
- Forever features that are endless reworked and never make it to GA.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to move this to goals, to ensure we don't have forever features. Ideally you want to enforce some kind of timeline for every feature, but if you see something is stuck for too long you might want to consider to drop it, which is also easier depending on the stage the features is at. Alpha features can be dropped almost at a whim, whereas beta (especially those enabled by default) or stable require a proper deprecation period to allow users to stop using those.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see what you mean, having features while models all of this explicitly does help with the problem. And the problem exists whether it's modeled or not.

Comment thread 0035-feature-flags/README.md Outdated
- Rollout considerations, like can we _backport_ flags to previous features so that they can be associated with verions
or disabled?

## FIXME Design Details
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will depend on whether a particular functionality alters the package or its deployment or not. But storing that information in package metadata is definitely good idea.

Comment thread 0035-feature-flags/README.md Outdated
# Enabled is set if a feature is set.
Enabled bool
# Default is set if a feature is enabled by default, without being set by users.
Default bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest not to store Enabled next to Default, but rather keep this structure minimal such that it will only hold information about feature gate, such as: name, description, version, enabled, stage. Separately, you'll want to keep a registry of currently enabled flags (those with enabled=true out of the box and those enabled by users explicitly through flags), and devs should always use the registry helper to know if it's on or off.

This separation ensures we don't accidentally enable something that should not be enabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, and the default vs. enabled states were intended to allow for this behavior in the same set of features. Thinking through this more though, there's another complication if a feature is disabled by default then enabled by users (default=true, enabled=false) -> (default=true, enabled=true) - that effectively overwrites the flag and there's no way for us to distinguish during runtime that it was changed.

I see the advantages of removing the "default" field and instead having two different sets of features. Having a simpler API starts to outweigh the downsides of some more impl. complexity. Namely, it allows us to store multiple copies of flags and see if the state has been changed by users. e.g. psuedocode:

default={Foo=false, bar=...}
enabled={Foo=true}

This could be accomplished with another field on the struct, e.g. modified and constructors (e.g. feature.NewDefault, feature.NewEnabled, etc.) but I'd also lean towards keeping the struct minimal to keep the API simple and "just" use sets instead of boolean logic.

I'll update the API some so it can filter for default or not rather than passing the fields back and having the user do so.

Comment thread 0035-feature-flags/README.md Outdated
- Make new features opt-in.
- Have a clear and documented reference for users on which versions of Zarf introduce new features.
- Runtime-level feature configuration
- Stretch goal: multiple avenues of configuration (code, env vars, flags, config files)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, you could also have a better way to inform users about deprecated features, if they are using those.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a benefit, going to add it to goals.

Comment thread 0035-feature-flags/README.md Outdated
// if len(flags) == 0 {...}
```

### TODO DISCUSS: Global API
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supportive.


Feature flags are implemented on an opt-in basis. Upgrading to versions with feature flagging supported, and with
toggle-able feature flags ought to be simple and transparent. There are no backwards compatibility concerns for previous
versions of Zarf as feature flags deal with the code itself.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mkcp added 3 commits July 29, 2025 10:01
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
@mkcp mkcp marked this pull request as ready for review July 31, 2025 16:55
@mkcp
Copy link
Copy Markdown
Contributor Author

mkcp commented Jul 31, 2025

Revision 2 of the proposal is done which instead proposes globally stored features and updates the API accordingly. I've provided some examples of how this is intended to work, but if there's any gaps please let me know!

I believe this is ready to merge given final draft review and alpha impl is ready to begin. There's still some TODOs left in that I'd like to come back to after implementing alpha. Nitpicks absolutely welcome!

@mkcp mkcp added this to Zarf Jul 31, 2025
@mkcp mkcp moved this to PR Review in Zarf Jul 31, 2025
mkcp added 5 commits July 31, 2025 12:32
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
…eads better

Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
…mments to slashes

Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Copy link
Copy Markdown
Member

@AustinAbro321 AustinAbro321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far. Clarifying question on CLI UX

Comment thread 0035-feature-flags/README.md Outdated
// My beautiful Zarf-enabled application
```

### CLI: Enabling and Disabling Flags
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious on how this would look for something like --registry-proxy in #37. I think that'd be a good candidate for feature flag.

I would expect that if users are required to run zarf init --feature-enable=registry-proxy --registry-proxy, just running zarf init --feature-enable=registry-proxy wouldn't actually do anything in that case. Also if someone runs zarf init -h before registry-proxy is enabled should they see the --registry-proxy flag in help text?

Copy link
Copy Markdown
Contributor Author

@mkcp mkcp Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first form that you mentioned is also what I'd expect, that way the feature itself can be enabled but not actually run. Users could also enable the feature through ZARF_FEATURE_ENABLE="registry-proxy" zarf init --registry-proxy or through on-disk config so the main idea is keeping parity. Given that registry-proxy is an option, flagging it enables that option.

As for help text, it's a good callout. In a quick experiment, I wasn't able to get the boolean state of one rootCmd pflag to mark another as hidden (e.g. if IsColorDisabled { ...MarkHidden("--log-level") }. There may be another path though.

That said, the least clever path would be showing the flag text every time and documenting its release stage and which feature flag to use.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah worth experimenting more on that in the future, but marking it as disabled in the flag text as a backup is reasonable for now.

mkcp added 2 commits August 1, 2025 09:11
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Signed-off-by: Kit Patella <kit@defenseunicorns.com>
Copy link
Copy Markdown
Member

@AustinAbro321 AustinAbro321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few todos still, but this is certainly in a state where we can start iterating on the implementation. Nice work.

@mkcp mkcp merged commit abd44b4 into main Aug 1, 2025
1 check passed
@mkcp mkcp deleted the mkcp/00XX-feature-flags branch August 1, 2025 19:39
@github-project-automation github-project-automation Bot moved this from PR Review to Done in Zarf Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants