ZEP-0035: Feature Flags by mkcp · Pull Request #41 · zarf-dev/proposals

mkcp · 2025-07-15T16:06:04Z

One-line PR description:
This ZEP proposes the need for, API, docs, and implementation of a Feature Flagging library in Zarf.
Issue link:
Feature Flags #35
Other comments:
Currently Revision 2, ready for pre-alpha review and potential merge.

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

AustinAbro321 · 2025-07-16T14:20:43Z

+- Rollout considerations, like can we _backport_ flags to previous features so that they can be associated with verions
+or disabled? 
+
+## FIXME Design Details


One important note with Zarf is that anything in the schema will span to both create time and deploy time. If a feature is enabled and used at create time, do we assume that it's enabled for the deployer. I'd lean towards yes, but curious on others thoughts.

That's a good point, yeah. This is maybe a hacky solution, but what if we retrieved the set ofAll() enabled features when creating a package and serialized their names to a build or metadata field?

I could see that being useful in the event we are changing how a specific feature in the schema works (though maybe if we're doing that it's a sign we should be using a different key).

I was thinking of an even simpler case where we have a gated feature in the schema and someone opts into it, would that feature also be gated on deploy?

I believe this will depend on whether a particular functionality alters the package or its deployment or not. But storing that information in package metadata is definitely good idea.

AustinAbro321 · 2025-07-16T14:21:15Z

+// if len(flags) == 0 {...}
+```
+
+### TODO DISCUSS: Global API


I'd prefer global. From what I can tell it looks like Kubernetes declares their feature flags globally on init() https://github.com/kubernetes/kubernetes/blob/c4beea784c4ba2ab4cdf98d13dfe094fcb746288/pkg/features/kube_features.go#L1945

I'm supportive.

mkcp · 2025-07-17T17:35:38Z

+  StageDeprecated Stage = "deprecated"
+)
+
+type Flag struct {


Considering changing this to Feature to make it a bit more clear what it's describing.

Migrating Flag type to Feature in latest

mkcp · 2025-07-17T17:35:56Z

+### Types
+Below is the proposed data model for Feature Flags, including the `Flag` type and supporting fields in the `flag` pkg.
+```
+pkg flag 


Same as the type name below, considering changing the pkg name to feature to make it a bit more clear what it's describing.

MIgrating package flag -> feature in latest.

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

brandtkeller

Processing my understanding - appreciative of the huge scope this is appropriately engaging.

brandtkeller · 2025-07-18T20:17:17Z

+of release and deprecation, eg. "alpha", "beta", "GA", "deprecated". It introduces a ctx-based API, similar to previous
+usages of feature flags in Zarf, a centralized location where all CLI feature flags are **declared** so that users may
+easily find Flags and their documentation in code. Additionally, centralizing where Flags are declared allows us to
+automate site documentation that corresponds one to one with the docs in code. Inspiration for the Flag fields is taken


Agree with intent - curious if there is an overlay of alternative designs that uses our already-available utilization of cobra/viper/pflag . Are there constraints present here with the lifecycle exposed by those and our current use? maybe.

pflag has a lot in the way of lifecycle management and the integration with cobra/viper already answers questions for how we consume from config / generate documentation and manage the lifecycle of active/hidden/deprecated flags.

Just want to ensure we weigh that optionality before understanding if we require a bespoke implementation.

I believe the plag will only to handle the flag parsing part of things. This document focuses primarily on the managing the feature gates, enabled through flags. For example, given the following 4 features at various stages:

featureA - stable, enabled by default, can't be disabled.

featureB - beta, enabled by default, can be disabled.

featureC - beta, disabled by default, can be enabled.

featureD - alpha, disabled by default, can be enabled.

A zarf user, by default, has access to featureA and featureB (on by default). If a user wants to enable featureC, one needs to run zarf --feature-gate=featureC=true, if one wants both featureC and featureD - zarf --feature-gate=featureC=true,featureD=true. Alternatively if user wants to disable a feature that is on by default - zarf --feature-gate=featureB=false.

From a developer perspective you have 2 helpful utilities:

A registry of feature gates, ideally with information about its state (alpha, beta, stable), whether it's on or off by default, and lastly if it can be turned off.

A helper method which allows you to check if featureA is on or off, that you then use in your code base.

An example from k8s:

Feature metadata: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/features/kube_features.go#L405-L410 with a pointer to KEP, and history https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/features/kube_features.go#L1315-L1319

Sample usage: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/apis/batch/v1/defaults.go#L62

On the other hand, kubectl uses env vars to achieve this:

Feature metadata: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/staging/src/k8s.io/kubectl/pkg/cmd/util/helpers.go#L434-L438

Sample usage: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/staging/src/k8s.io/kubectl/pkg/cmd/cmd.go#L362-L364

Although, I'm planning to simplify the interface for kubectl a bit, since it's a bit clunky atm. The goal is to make it closer to what --feature-gates do.

Also, we could just re-use the k8s library for feature gates, rather than rolling yet another one https://github.com/kubernetes/component-base/blob/master/featuregate/feature_gate.go. Or use some other library that I'm pretty exist 😅

Seconding what @soltysh is saying here where pflag works great as a kind of CLI frontend for sure, but doesn't provide a flexible enough model or API for features. Especially when it comes to users needing to query or change feature state within code.

Regarding borrowing or reusing the K8s feature-gates - I definitely took a lot of inspiration from the implementation! Reading through the type was my starting point. K8s runtime mutable features which are compelling, as well as multiple features per gate. Maybe more than we need pre-1.0 though?

If are effective with the API and configuration layer I do think we can come back to the implementation and add functionality in or swap out for more if we need it in the future. The config layer (CLI flags, env vars, zarf-config.{yaml,toml}) already does this well imo. Where we'll need to be careful are the abstractions we offer to users for the SDK.

brandtkeller · 2025-07-18T20:27:36Z

+
+Feature flags are implemented on an opt-in basis. Upgrading to versions with feature flagging supported, and with
+toggle-able feature flags ought to be simple and transparent. There are no backwards compatibility concerns for previous
+versions of Zarf as feature flags deal with the code itself.


There are no backwards compatibility concerns for previous versions of Zarf as feature flags deal with the code itself.

Does this consider the impact of the package created and potential drift between the zarf version at creation and the version used during deploy?

Discussed in https://github.com/zarf-dev/proposals/pull/41/files#r2210582550

brandtkeller · 2025-07-18T20:30:38Z

+- Also if maintainers rely on enabling feature flags at specific versions then this keeps SDK to CLI parity.
+- Without global defaults we have to flip the boolean from enabled to disabled when providing a beta feature by default.
+- A major drawback with ctx is that it's difficult to inspect for users. "where did this flag come from?"
+- Maybe this is a good nudge to get features into GA sooner. More to discuss here.


++ It is a great reason for us to have a process that tracks current state and what we require for graduation?

brandtkeller · 2025-07-18T20:34:04Z

+-->
+
+- One critical pitfall to avoid with feature flags is creating _unexpected_ behavior, both for end users and maintainers.
+  Unclear state about which flags are enabled and not, and with and without defaults. We have some solutions below under


appreciate this - how can we make the propagation throughout the system more observable.

I think that w/ the addition of separate sets for default and user features that sourcing these gets even easier.

soltysh · 2025-07-28T14:12:57Z

+of release and deprecation, eg. "alpha", "beta", "GA", "deprecated". It introduces a ctx-based API, similar to previous
+usages of feature flags in Zarf, a centralized location where all CLI feature flags are **declared** so that users may
+easily find Flags and their documentation in code. Additionally, centralizing where Flags are declared allows us to
+automate site documentation that corresponds one to one with the docs in code. Inspiration for the Flag fields is taken


I believe the plag will only to handle the flag parsing part of things. This document focuses primarily on the managing the feature gates, enabled through flags. For example, given the following 4 features at various stages:

featureA - stable, enabled by default, can't be disabled.

featureB - beta, enabled by default, can be disabled.

featureC - beta, disabled by default, can be enabled.

featureD - alpha, disabled by default, can be enabled.

A zarf user, by default, has access to featureA and featureB (on by default). If a user wants to enable featureC, one needs to run zarf --feature-gate=featureC=true, if one wants both featureC and featureD - zarf --feature-gate=featureC=true,featureD=true. Alternatively if user wants to disable a feature that is on by default - zarf --feature-gate=featureB=false.

From a developer perspective you have 2 helpful utilities:

A registry of feature gates, ideally with information about its state (alpha, beta, stable), whether it's on or off by default, and lastly if it can be turned off.

A helper method which allows you to check if featureA is on or off, that you then use in your code base.

An example from k8s:

Feature metadata: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/features/kube_features.go#L405-L410 with a pointer to KEP, and history https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/features/kube_features.go#L1315-L1319

Sample usage: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/pkg/apis/batch/v1/defaults.go#L62

soltysh · 2025-07-28T14:20:40Z

+of release and deprecation, eg. "alpha", "beta", "GA", "deprecated". It introduces a ctx-based API, similar to previous
+usages of feature flags in Zarf, a centralized location where all CLI feature flags are **declared** so that users may
+easily find Flags and their documentation in code. Additionally, centralizing where Flags are declared allows us to
+automate site documentation that corresponds one to one with the docs in code. Inspiration for the Flag fields is taken


On the other hand, kubectl uses env vars to achieve this:

Feature metadata: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/staging/src/k8s.io/kubectl/pkg/cmd/util/helpers.go#L434-L438

Sample usage: https://github.com/kubernetes/kubernetes/blob/2a03dd1d5ecc66ca6a21fc9f526d4eedc9fef179/staging/src/k8s.io/kubectl/pkg/cmd/cmd.go#L362-L364

Although, I'm planning to simplify the interface for kubectl a bit, since it's a bit clunky atm. The goal is to make it closer to what --feature-gates do.

soltysh · 2025-07-28T14:23:29Z

+### Non-Goals
+
+- Elaborate configuration nightmares, where users don't know what is enabled or multiple flags are intertwined.
+- Forever features that are endless reworked and never make it to GA.


I'm inclined to move this to goals, to ensure we don't have forever features. Ideally you want to enforce some kind of timeline for every feature, but if you see something is stuck for too long you might want to consider to drop it, which is also easier depending on the stage the features is at. Alpha features can be dropped almost at a whim, whereas beta (especially those enabled by default) or stable require a proper deprecation period to allow users to stop using those.

Yeah I see what you mean, having features while models all of this explicitly does help with the problem. And the problem exists whether it's modeled or not.

soltysh · 2025-07-28T14:25:47Z

+- Rollout considerations, like can we _backport_ flags to previous features so that they can be associated with verions
+or disabled? 
+
+## FIXME Design Details


I believe this will depend on whether a particular functionality alters the package or its deployment or not. But storing that information in package metadata is definitely good idea.

soltysh · 2025-07-28T14:30:12Z

+  # Enabled is set if a feature is set.
+  Enabled bool
+  # Default is set if a feature is enabled by default, without being set by users.
+  Default bool


I'd suggest not to store Enabled next to Default, but rather keep this structure minimal such that it will only hold information about feature gate, such as: name, description, version, enabled, stage. Separately, you'll want to keep a registry of currently enabled flags (those with enabled=true out of the box and those enabled by users explicitly through flags), and devs should always use the registry helper to know if it's on or off.

This separation ensures we don't accidentally enable something that should not be enabled.

That's a good point, and the default vs. enabled states were intended to allow for this behavior in the same set of features. Thinking through this more though, there's another complication if a feature is disabled by default then enabled by users (default=true, enabled=false) -> (default=true, enabled=true) - that effectively overwrites the flag and there's no way for us to distinguish during runtime that it was changed.

I see the advantages of removing the "default" field and instead having two different sets of features. Having a simpler API starts to outweigh the downsides of some more impl. complexity. Namely, it allows us to store multiple copies of flags and see if the state has been changed by users. e.g. psuedocode:

default={Foo=false, bar=...} enabled={Foo=true}

This could be accomplished with another field on the struct, e.g. modified and constructors (e.g. feature.NewDefault, feature.NewEnabled, etc.) but I'd also lean towards keeping the struct minimal to keep the API simple and "just" use sets instead of boolean logic.

I'll update the API some so it can filter for default or not rather than passing the fields back and having the user do so.

soltysh · 2025-07-28T14:33:41Z

+- Make new features opt-in.
+- Have a clear and documented reference for users on which versions of Zarf introduce new features.
+- Runtime-level feature configuration
+- Stretch goal: multiple avenues of configuration (code, env vars, flags, config files)


Potentially, you could also have a better way to inform users about deprecated features, if they are using those.

Definitely a benefit, going to add it to goals.

soltysh · 2025-07-28T14:34:50Z

+// if len(flags) == 0 {...}
+```
+
+### TODO DISCUSS: Global API


I'm supportive.

soltysh · 2025-07-28T14:35:26Z

+
+Feature flags are implemented on an opt-in basis. Upgrading to versions with feature flagging supported, and with
+toggle-able feature flags ought to be simple and transparent. There are no backwards compatibility concerns for previous
+versions of Zarf as feature flags deal with the code itself.


Discussed in https://github.com/zarf-dev/proposals/pull/41/files#r2210582550

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

mkcp · 2025-07-31T16:57:50Z

Revision 2 of the proposal is done which instead proposes globally stored features and updates the API accordingly. I've provided some examples of how this is intended to work, but if there's any gaps please let me know!

I believe this is ready to merge given final draft review and alpha impl is ready to begin. There's still some TODOs left in that I'd like to come back to after implementing alpha. Nitpicks absolutely welcome!

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

…eads better Signed-off-by: Kit Patella <kit@defenseunicorns.com>

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

…mments to slashes Signed-off-by: Kit Patella <kit@defenseunicorns.com>

AustinAbro321

Nice work so far. Clarifying question on CLI UX

AustinAbro321 · 2025-08-01T13:56:53Z

+// My beautiful Zarf-enabled application
+```
+
+### CLI: Enabling and Disabling Flags


Curious on how this would look for something like --registry-proxy in #37. I think that'd be a good candidate for feature flag.

I would expect that if users are required to run zarf init --feature-enable=registry-proxy --registry-proxy, just running zarf init --feature-enable=registry-proxy wouldn't actually do anything in that case. Also if someone runs zarf init -h before registry-proxy is enabled should they see the --registry-proxy flag in help text?

The first form that you mentioned is also what I'd expect, that way the feature itself can be enabled but not actually run. Users could also enable the feature through ZARF_FEATURE_ENABLE="registry-proxy" zarf init --registry-proxy or through on-disk config so the main idea is keeping parity. Given that registry-proxy is an option, flagging it enables that option.

As for help text, it's a good callout. In a quick experiment, I wasn't able to get the boolean state of one rootCmd pflag to mark another as hidden (e.g. if IsColorDisabled { ...MarkHidden("--log-level") }. There may be another path though.

That said, the least clever path would be showing the flag text every time and documenting its release stage and which feature flag to use.

Yeah worth experimenting more on that in the future, but marking it as disabled in the flag text as a backup is reasonable for now.

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

AustinAbro321

A few todos still, but this is certainly in a state where we can start iterating on the implementation. Nice work.

mkcp added 2 commits June 30, 2025 11:57

init ZEP-0035

7bcac8e

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

iterate on ZEP-0035

964b289

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

mkcp self-assigned this Jul 15, 2025

continue iterating, clean up everything up to and including proposal

125f15a

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

mkcp mentioned this pull request Jun 25, 2025

Feature Flags #35

Open

6 tasks

iterate on 0035

f93a166

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

mkcp marked this pull request as draft July 15, 2025 18:25

AustinAbro321 reviewed Jul 16, 2025

View reviewed changes

mkcp commented Jul 17, 2025

View reviewed changes

clean up template in completed areas, add some notes for more design

91f681a

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

brandtkeller reviewed Jul 18, 2025

View reviewed changes

soltysh reviewed Jul 28, 2025

View reviewed changes

mkcp added 3 commits July 29, 2025 10:01

most of revision 2 done. some api work after discussion

cec56d2

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

Complete revision 2 API, abandoned ideas, and most prose

ca51584

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

fix some API goofs

190031a

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

mkcp marked this pull request as ready for review July 31, 2025 16:55

mkcp added this to Zarf Jul 31, 2025

mkcp moved this to PR Review in Zarf Jul 31, 2025

mkcp added 5 commits July 31, 2025 12:32

ensure defaults example has descriptions and an owner

e9505e2

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

feature.Store reads more like an implementation detail, feature.Set r…

7cd611c

…eads better Signed-off-by: Kit Patella <kit@defenseunicorns.com>

update risks and mitigations

cd38f5a

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

update some inconsistencies in summary, motivation, and goals sections

3813ed9

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

fix Enabled type IsEnabled function name collision. also fix pound co…

6aa327e

…mments to slashes Signed-off-by: Kit Patella <kit@defenseunicorns.com>

AustinAbro321 reviewed Aug 1, 2025

View reviewed changes

mkcp added 2 commits August 1, 2025 09:11

remove infra concerns section

07d24e3

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

update status to implementable and update target milestones

4d72b15

Signed-off-by: Kit Patella <kit@defenseunicorns.com>

AustinAbro321 approved these changes Aug 1, 2025

View reviewed changes

mkcp merged commit abd44b4 into main Aug 1, 2025
1 check passed

mkcp deleted the mkcp/00XX-feature-flags branch August 1, 2025 19:39

github-project-automation Bot moved this from PR Review to Done in Zarf Aug 1, 2025

Conversation

mkcp commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandtkeller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkcp commented Jul 31, 2025

Uh oh!

AustinAbro321 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkcp Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mkcp commented Jul 15, 2025 •

edited

Loading

mkcp Aug 1, 2025 •

edited

Loading