The current state of the CLI flow is undesirable. Validations and defaults are invoked from multiple places and at different stages with some validations contain hidden dependencies. This creates an inconsistent user experience (error logging, etc.) and, more importantly, a significant maintenance issue. Given a big part of EKS-A's added value is its validations, this topic is of tremendous importance for the project.
There have been previous efforts on this area, however, they only focused on the type of validation and where to put the validation code, not about when or how to run them. Although this set the foundations for a better separation of validations between webhooks and controllers, it failed to solve the CLI inconsistency and coupling problems. Moreover, it didn't align well with the team's intent of keeping provider specific validations in their own package.
This document attempts to solve this issue by proposing a wholistic approach for defaulting and validating both CLI and API operations.
There are two types of validations:
- Context agnostic validations: there are no external dependencies, having the API object is enough to validate it: mutually exclusive fields, string format, empty fieldsβ¦
- These should live in the pkg/api/version package
- These should be the ones called in the validation webhook
- Context aware validations: these are dependent on other entities like clients or even other API objects
- These could leave in different places, Iβm not making a decision right now: providers, validations packageβ¦ This is probably worth a separate discussion.
- Try to keep these modular and reusable (eg. Donβt make them part of the providers structβs api)
- If they only need other API objects from the same package, they can live in the api package
- Controller: these shouldnβt be run in a webhook and must be part of the main reconciliation loop
Validations have grown organically over time. However, the natural evolution of requirements and usecases without taking the time to refactor previous implementations, has resulted in an incohesive array of frameworks, ideas and patterns. This doc doesn't intend to collect or represent them all in detail, but let this list be an example of the lack of cohesion and variety of usecases as well as a data to drive the solution:
pkg/validations
- It holds validation of many different domains (feature gates, certificates, docker daemon, local disk files, command line args, etc.), all implemented as standalone functions.
- It also offers some validation "utils/framework", like a validation runner, reporting, error types, etc.
- The validations are imported from validation compositions as well as directly from the commands in
cmd
.
pkg/validations/createvalidations
- This implements both cluster create validations (as standalone functions) as well as a "validation composition" that allows to build and run "preflight" validations for the create operation.
- This uses
validations.Validation
andvalidations.ValidationResult
but not thevalidations.Runner
. - The compiled validations (
createvalidations.CreateValidations
) is used in thecluster create
command, injected as a dependency to the corresponding workflow and called from theSetAndValidateTask
task (more on this later).
pkg/validations/upgradevalidations
- This implements both cluster upgrade validations (as standalone functions) as well as a "validation composition" that allows to build and run "preflight" validations for the upgrade operation.
- Strangely, this uses the
validations.ValidationResult
but not thevalidations.Runner
and works slightly different thanCreateValidations
. - The compiled validations (
upgradevalidations.UpgradeValidations
) is used in thecluster upgrade
command, injected as a dependency to the corresponding workflow and called from thesetupAndValidateTasks
task (more on this later).
pkg/validations/createcluster
- This is a validations composition. It attempts to group all validations for
create cluster
(the ones inpkg/createcluster
, provider validations, docker executable validations, kubeconfig path,cluster.Config
validations and Flux validations). - It uses
validations.Runner
to register, run the validations sequentially and report the validations result withvalidations.ValidationResult
. - It's a step in the good direction in terms of pattern and semantics vs other previous ideas but it might still be limited in scope.
- This is a validations composition. It attempts to group all validations for
pkg/api/v1alpha1
- Most eks-a API objects implement the
Validate
method, which is supposed to run only static data validations (or "context unaware"). Unfortunately most of these methods return at the first error as opposed to run them all and aggregate the errors. - Most eks-a API objects implement the
SetDefaults
method. - Most webhooks are also implemented in this package. They are composed of a combination of
object.Validate
calls and field immutability checks (forValidateUpgrade
). For defaults, they callSetDefaults
.
- Most eks-a API objects implement the
pkg/cluster
- This package exports
cluster.Config
and allows to set defaults and validate it withcluster.SetDefaults
andcluster.Validate
. These internally call aConfigManager
, which allows to register logic dynamically to parse from yaml, set defaults and validate acluster.Config
. - Validations in the package either call
.Validate
in the API objects or implement static validations but at theConfig
level (eg. validations that need multiple API objects). ConfigManager
's defaults also call theSetDefaults
method in API objects.
- This package exports
pkg/gitops/flux
- Exports validations through
Validations
method, which returns a slice ofvalidations.Validation
, hence designed to be used in conjunction withvalidations.Runner
, etc. These are run from the create cluster workflow.
- Exports validations through
pkg/providers
- All providers implement
SetupAndValidateCreateCluster
andSetupAndValidateUpgradeCluster
. They are called from the create and upgrade workflows. - These methods mix API defaults, validations and internal state changes (equivalent to an
Init
method). - The internal implementation of these method variates from provider to provider. Some of them have a
Validator
and aDefaulter
and Snow, for example, uses thecluster.ConfigManager
to organize defaults and validations.
- All providers implement
pkg/workflows
- Compose and run validations and defaults with
validations.Runner
from a workflow task.
- Compose and run validations and defaults with
test/framework
ClusterValidator
is yet another validator runner. This is one centered in validating the state of a cluster against a given cluster specification by using a k8s client to inspect the cluster API.
- Special note is due for commands in
cmd
, since these are the entities with more knowledge and they are a good example of the fragmentation addressed in this doc. Currently, they handle different types of validations and at different stages:- Independently run validations, like:
FileExists
for the cluster config yaml.- Parsing and Cluster config API validation (
GetAndValidateClusterConfig
) - Tinkerbell CLI flags.
- Registry mirror auth.
- Validations to
cluster.Config
withcluster.Validate
. Defaults are applied by just callingcluster.NewSpecFromClusterConfig
.
- "Grouped" validations in
commonValidations
- Docker daemon config
- File exists, parsing and Cluster config API validation (
GetAndValidateClusterConfig
). This is duplicated. - Kubeconfig file and format validation.
- Delegated validations to the workflows through
interfaces.Validator
.
- Independently run validations, like:
- Define a standard for types of validation and defaults.
- Define where validations and defaults should be defined within the code base.
- Define a standard for when and where to run validations and defaults for both CLI commands and API requests.
- Define what validations and defaults should and should not do (scope).
- Propose a mechanism to run defaults from a single place.
- This needs to allow for defaults defined in different packages.
- Propose a mechanism to run validations from a single place.
- This needs to allow for validations defined in different packages.
- This should allow to run all validations and collect all errors, as opposed to return after the first error.
- Propose an execution plan to move all existing validations to such model.
Validations and defaults can be classified in:
- Static data: they only require pure API objects and have no other dependencies.
- Context aware: they need to read information from the context (env vars, files, external APIs, etc.). Generally the depend on clients/adapters and a
context.Context
.
Although this doc focuses on API validations, it's worth mentioning CLI input validation. This entails command arguments, flags and env vars. In general, these should be fast and have no dependencies. However, it's possible config files (provided through a path) might need to be parsed in order to run conditional validations (for example, depending on the provider specified in the cluster config, extra flags or env vars might be needed).
Static data validations/defaults that only require one single API object should live in the Validate
and SetDefaults
api struct methods, thus in the pkg/api/v1**
package. (Note: I'm more than open to relax this constraint if there is a good reason to, but given this is one of the only things that seem to resemble some kind consistency with regards to this whole topic, it seems worth to maintain it).
Immutability validations have been treated differently (for some unknown reason) in the CLI than in webhooks. These checks should be implemented in the api
package and should not be duplicated in two places. If the same checks are needed in both webhook and CLI, implement as standalone functions and reuse in both places. In the CLI, just read the object with a client and pass both old and new like the webhook does.
The rest of validations and defaults (static data for a group of API structs and the context aware ones) can live anywhere, as long the dependency directions make sense, with a few constraints:
- Context aware validation cannot live in
pkg/api/v1**
. This is to keep this package dependencies at a minimum. - Avoid packages like
validations
. Don't group validations for what they are, group them by what they validate (their context). For example, for docker validations, don't usevalidations.ValidateDockerDesktopVersion
, think aboutdocker.ValidateDesktopVersion
inpkg/docker
. - Avoid the
cmd
package. These tend to end up being untested and lack structure.
Webhooks that only require static data validations/defaults should live in the pkg/api/v1**
(where all present webhooks live today). If they have any other dependency the cannot live in the api
package. (Note: we could possibly define a webhooks
folder elsewhere if we think that could avoid problems in the future, leaving this up for discussion)
- Defaults should not modify data provided by the user, they can only add information. Any transformation needed should be moved to runtime operations and never get reflected back on the API objects.
- They should have not secondary effects on anything else except the provided API object/s, including the internal state of the entity running the validation.
- API objects should only be modified by the defaulting logic. This guarantees they are invariants during the CLI execution, allowing for workflow reentry/re-execution (think command fails and on the next execution we continue from where it stopped).
- They must not have secondary effects on such as modifying the object under validation.
- API struct
Validate
methods should return afield.ErrorList
. This facilitates using it from webhooks and ensures the API user facing error is as complete as possible.
- CLI
- At the beginning of the command flow, in most cases before any other business logic.
- Always before validations.
- Defaults will be grouped into a single usecase per CLI flow: one for create cluster, another one for upgrade, etc. The individual defaulting logic can be reused by importing it and composing it into multiple usecases. These will live in
pkg/cli
, and be named after its flow:CreateClusterDefaulter
,UpgradeClusterDefaulter
, etc. - This doc doesn't prescribe which entity should trigger the defaults (command in
cmd
package, workflow, etc.). But they do need to be run before the API objects get injected in any other entity as a dependency (think service objects). Given the approach the provider's refactor project has taken to inject thecluster.Spec
and any other non dynamic data (this is, known at the beginning of the command execution) as state in the service objects,cmd
seems like the obvious place to do this since this is where service objects are constructed.
- API
- Webhook: static data defaults (only depend on the given object) and sometimes, defaults that only require kubernetes API calls (to the same cluster where the object lives). Remember that only the given object should be altered from the webhook.
- Controllers: any other defaults requiring extra information, calls to external systems or simply too slow for a webhook.
- CLI
- At the beginning of the command flow, in most cases before any other business logic.
- Always after defaults.
- Validations will be grouped into a single usecase per CLI flow: one for create cluster, another one for upgrade, etc. The individual validation logic can be reused by importing it and composing it into multiple usecases. These will live in
pkg/cli
, and be named after its flow:CreateClusterValidator
,UpgradeClusterValidator
, etc. - This doc does not prescribe the entity invoking validations (command in
cmd
package, workflow, etc.). The decision is left for the provider refactor team to make later.
- API
- Webhook: static data validations (only depend on the given object) and sometimes, validations that only require kubernetes API calls (to the same cluster where the object lives).
- When using a client to read from the API server, be very careful. Webhooks are supposed to be very fast, remember they are synchronous.
- Only run efficient queries (avoid full scans) and always think about the number of results you expect to get (depending on the type, the difference can be of orders of magnitude).
- If your design performance is not very obvious, benchmark it.
- When using a client to read from the API server, be very careful. Webhooks are supposed to be very fast, remember they are synchronous.
- Controllers: any other validations requiring extra information, calls to external systems or simply too slow for a webhook.
- Webhook: static data validations (only depend on the given object) and sometimes, validations that only require kubernetes API calls (to the same cluster where the object lives).
This section describes the validation framework and how we will be using it. Requirements:
- Allow to compose and run validations implemented in different places.
- Make implementing validations as easy as possible.
- Design mainly for the CLI use-case. Optimize defaults for this scenario and possibly allow custom configuration for others.
This approach is similar to cluster.ConfigManager
, and some other approaches, in that it takes a two step process where validations are registered before they are run.
// Runner allows to compose and run validations.
type Runner struct
runner := validation.NewRunner()
runner.Register(myValidation)
runner.Register(otherValidations...)
err := runner.RunAll(ctx, spec)
Requirements:
- They might need to contact external systems. In that case they should make use of a
context.Context
. - They might need access to the full specification of an EKS-A cluster. In that case, they should use
cluster.Spec
. - Validations should be able to, at least, return an error if the input is not valid.
This boils down to take a context and a cluster spec as input and return some type of error as output. We could implement it with functions or an interface, but given that we want to make registering validations as easy as possible and keep the framework API surface as simple as possible, we opt out for a function:
type Validation func(ctx context.Context, spec *cluster.Spec) error
As seen above, the validation Runner
allows to register these Validation
funtions independtly. And as a separate step run them all, returning an aggregation of all errors.
We want the framework to allow, not only to register validations implemented in different places, but also validations that are composed of multiple validations. This, although it introduces some complexity, it offers a lot of flexibility to registrars to aggregate validations in different ways and it facilitates reusing existing validations (that tend to be already aggregated in one big method).
For validations to be registered as one handler but still be considered as different in the runner result, they should return an errors.Aggregate
. The framework will understand this and will flatten the result. This allows for multiple levels of nesting.
Most preflight validations we currently run in the CLI can be run concurrently as they are not dependent on each other. For validations that make external calls this can be slow, concurrently executing validations may provide a speed improvement.
One could argue that, given validations are composable, concurrency can be left for registrars/validation implementers. However, in the past this approach has led to concurrency never been implemented.
We will take advantage of validations being composable and will make the default model concurrent. All handlers registered through Register
will run concurrently.
If there is a subset of validations that need to be sequential, we will compose them into one handler that iterates and aggregates the results via a Sequentially
helper.
// Sequentially composes a set of validations into one which will run them sequentially and in order.
func Sequentially(validations ...Validation) Validation
It can be tempting to allow to compose concurrent validations inside a sequential collection. Although appealing, this could significantly increase the complexity of the runner. Since we can't find a place where this is currently needed, the framework will only run concurrently the first level validations. If that ever becomes a requirement, we will default to wrap another runner inside a handler. Given all returned aggregated errors are flatten, this should work without changes to the framework.
Some of the current attempts at a validation runner/composer log the success of validations. This requires providing a name for the validation. In addition, some of the provider validations log both success and failure of validations (using β and β).
Trying to handle this at the framework level has several drawbacks:
- It requires naming validations, which increases the framework surface area, increases the entry level to register a validation and composability more difficult.
- It limits logging granularity when a validation function is composed of multiple sequential validations.
- It couples CLI output with validation functions expanding their scope of concern.
For these reasons, the framework won't provide any logging functionality.
* For debug logging (we can generalize this to anything above verbosity 0
), validations should use their own logger. This should be injected at construction time as any other dependency. This still gives control to the program to decide which logs are presented to the user and when, by passing different loggers.
* For information more tailored to feedback for the end user (like progress in slow validations), it will be left up to the framework callers to implement this. Based how logging is currently used to communicate this progress, the information is usually presented by "grouping" validations and not at the individual validation level. So the usecases, where validations are grouped and given semantic value, will be responsible to define these different groups and print logs in between them as necessary. This can be easily accomplished by composing multiple runners. Given a traditional logger might not be the best way to implement a presentation layer of this kind, this approach also allows to replace the logger with something else (in the future) without having to refactor any of the validation logic.
The existing validations.ValidationResult
allows handlers to return a Remediation
message. Given we want to just use the error
interface to make it easy to register validations and that Remediation
is empty for most ValidationResult
s , we chose to make this optional. Validations wanting to provide a possible remediation for a failure, can return an error
that also implements the Remediable
interface.
type Remediable type {
Remediation() string
}
Existing validation runners, like framework.ClusterValidator
, incorporate retries into the runner itself. That might seem sensible, given a lot of validations might want to rely on retries to bypass transient errors. However, not only this would complicate the API to register validations but the retry logic will always be limited by the API itself (which is only addressable by making the API more complex).
In this case we would leave this out of the framework and let each validation handler decide how to deal with retries.
Given the framework logic is very much independent from the object being validated (cluster.Spec
), we propose to use generics so the same framework can be used for different types. The complexity added is minimal and even in the case where the framework grows in a direction where generics don't make sense, moving back to a concrete type would be trivial. There are already places in the codebase where we envision make use of this:
// Validation is the logic for a validation of a type O.
type Validation[O any] func(ctx context.Context, obj O) error
In order to be able to prevent validations from modifying obj
and detecting when if that happens, we add a constraint to O
that allows us to DeepCopy
the object so we can compare after running all validations:
type Validatable[O] interface {
DeepCopy() O
}
// Validation is the logic for a validation of a type O.
type Validation[O Validatable[O]] func(ctx context.Context, obj O) error
This framework is the glue that will give form to the validation usecases. Example:
type CreateClusterValidator struct {
runner *validation.Runner
}
type Validation func(ctx context.Context, spec *cluster.Spec) (error)
type ProviderValidator interface {
CreateClusterValidations() []Validation
}
func NewCreateClusterValidator(provider ProviderValidatonRegistrar) CreateClusterValidator {
r := validation.NewRunner[*cluster.Spec]()
r.Register(
docker.ValidateMemory,
cluster.ValidateConfig,
)
r.Register(provider.CreateClusterValidations()...)
r.Register(gitops.ValidateRepositoryAccess)
}
func (v CreateClusterValidator) Run(ctx context.Context, spec *cluster.Spec) error {
return v.runner.RunAll(ctx, spec)
}
The requirements for defaults are quite similar to the ones for validations:
- Take a
context.Context
- Based on
cluster.Spec
- Be able to return an error if, for example, an external call fails.
In addition, and following good practices for functions that are supposed to update a given object:
- Don't rely on updating a given pointer value but take a
cluster.Spec
as input and return acluster.Spec
as output, with the necessary updates.
type Default func(ctx context.Context, spec cluster.Spec) (cluster.Spec, error)
Defaults should only be set in the API objects configurable by the user. For example, the Bundles
(which compiles all eks-a dependencies) shouldn't be modified. However, all of these are part of the cluster.Spec
, giving defaulting logic the ability to modify them. We could enforce this by separating them and providing a cluster.Config
(all user inputable API objects) and a Bundles
(or something else that includes the rest of the cluster specification).
Following the same logic as for validations, we propose the use of generics:
// Default is the logic for a default for a type O. It should return a value of O
// wether it updates it or not. When there is an error, return the zero value of O
// and the error.
type Defaulter[O any] func(ctx context.Context, obj O) (O, error)
All defaults will be sequential given that they need to update the cluster.Spec
and changes need to be propagated. However, defaults shouldn't rely on order. If such dependency exists between two or more defaults, then make them one single default.
Default handlers can choose to implement concurrency but they are responsible for handling data races.
This framework is the glue that will give form to the default usecases. Example:
type CreateClusterDefaulter struct {
runner *defaulting.Runner
}
type Default func(ctx context.Context, spec *cluster.Spec) (*cluster.Spec, error)
type ProviderDefaulter interface {
CreateClusterDefaults() []Default
}
func NewCreateClusterDefaulter(provider ProviderDefaulter) CreateClusterDefaulter {
r := defaulting.NewRunner[*cluster.Spec]()
r.Register(
cluster.DefaultConfig,
)
r.Register(provider.CreateClusterDefaults()...)
r.Register(gitops.DefaultRepositoryName)
}
func (v CreateClusterDefaulter) Run(ctx context.Context, spec *cluster.Spec) (*cluster.Spec, error) {
return v.runner.RunAll(ctx, spec)
}
- The validation runner framework will go in
pkg/validation
. - The defaults runner framework will go in
pkg/defaulting
. - The composed validations and defaults for
create cluster
command will go inpkg/cli
, implemented inCreateClusterValidator
andCreateClusterDefaulter
. - The composed validations and defaults for
upgrade cluster
command will go inpkg/cli
implemented inUpgradeClusterValidator
andUpgradeClusterDefaulter
. - All validations in
pkg/validations
,pkg/validations/createvalidations
andpkg/validations/upgradevalidations
should be moved to their appropriate packages. pkg/validations/createcluster
is replaced bycli.CreateClusterValidator
.- Validations and defaults in
pkg/cluster
might stay there but won't be called bycluster.Validate
andcluster.SetDefaults
, which will be removed. - Validation in
test/framework
should be probably be moved into proper packages. This will be low priority unless such validations need to be reused from the CLI controller.
We propose the addition of our own errors
package (in pkg/errors
) that will implement the Aggregate
error functionality. Although we will mirror the functionality from apimachinery
so we don't reinvent the wheel, this will give us more flexibility to adapt to changes in Go error aggregation and handling (possibly go 1.20).
In addition, we plan on using this package to mirror the Wrap
functionality from github.com/pkg/errors
. This will allow us to only import one errors
package from our code.
- Get this proposal approved
- Move validations and defaults out of workflows
- Split providers defaults and validations into two methods for both upgrade and delete.
- Update providers to return a slice of validations and defaults. The first iteration will probably be a wrapper around existing ones.
- Compose validations and defaults in
pkg/validation/*cluster
andpkg/default/*cluster
- Invoke the new aggregated defaults and validations that use the framework.
- Move validations from
pkg/validations/**
to the proper packages. - Cleanup: delete unused framework/runners,
cluster.ConfigManager
, etc. This will probably need multiple rounds. - Split providers validations and defaults into different Handlers that can be run independently.
- Update validations for
{apiobject}.Validate
to not return on first error and return afield.ErrorList
.