- Validate pipeline structure without having to run pipeline
- Define pipelines as code
pypyr currently reads pipelines from yaml files. pypyr is very forgiving when interpreting pipeline yaml - it parses step-by-step as it goes (or just-in-time, if you will) and does NOT validate the entire yaml file's structure in advance.
This does make pypyr very flexible, in that it's not restrictive about the over-all file's structure and is relatively fault-tolerant - even if one section of a pipeline is not valid, other valid sections will still work.
However, this also means that pipeline authors only discover at runtime whether there are structural errors in their pipeline yaml. pypyr currently has no way of validating that a pipeline's structure is correct and runnable without actually running the pipeline. This is annoying in long running pipelines because it could burn a lot of time before pypyr reaches a later malformed step in the pipeline and raises a formatting error.
Additionally, API consumers want to construct pipelines in code, rather than have to author in yaml. The original design objective with pypyr is that the pipeline yaml is intuitive and meant for humans to author and read. This remains the case and pypyr is in NO WAY getting rid of yaml pipelines - the design principle of simple, intuitively human-readable yaml pipelines still obtains.
However, pypyr can provide an alternative for API coders to construct pipelines in code. Currently API consumers construct pipelines in YAML, rather than have typed pipeline definitions that will make it possible to create pipelines right there in the code without having to switch context to a YAML file.
The plan is to augment the intuitive yaml pipeline authoring experience with a similarly intuitive authoring experience in code.
We can achieve both Wishlist items with least effort if we modeled the entirety of a pipeline in classes in code.
Expand the current DSL classes to encompass over-all pipeline structure.
Practically speaking, this means to create a PipelineBody
class to compose
the already existing pypyr.dsl.Step
classes.
PipelineDefinition
<-- exists alreadyPipelineBody
<-- NEW- Sequence of
Step
<-- exists already
- Sequence of
The PipelineBody
initialization is effectively the validation. If a
mapping
can parse into a PipelineBody
, it means it's a valid pipeline. This
will happen on loading the pipeline, so pipeline authors will know immediately
if there are problems with the pipeline, rather than have to wait until pypyr
reaches a step at runtime to know that it passes validation.
The does NOT validate the step in
input, this remains just-in-time. This
does seem like an oversight, but here are the reasons:
- Step
in
is dynamic. This is one of the strengths of pypyr. The step'sin
args could be derived dynamically from other preceding steps in the pipeline - we cannot know at design time what those values are or if they resolve as expected. The values could only retrieve at run-time. - The Step signature is simple - it's just
def run_step(context)
. The idea is to allow pipeline authors to create steps with a single function without having to inherit from classes or do more intricate interface driven development. The key objective is to be as low-code as possible. Whatever validation happens in a Step happens in that same function, so we don't have an easy way to separate a Step's validation from a Step running without breaking existing steps. A proposal here could be to introduce an optionaldef validate(context)
function that could introduce validation for a step as a separate step, BUT this would be of limited use since it couldn't possibly validate dynamic inputs, which is key to pypyr's power.
Given that the new PipelineBody
will contain a parsed, known-correct pipeline,
modify the pipeline cache and the run logic to consume the PipelineBody
,
rather than consuming the unstructured dict
representing the raw pipeline as
it was parsed.
Add API entrypoints for coders to run the PipelineDefinition
, or even the
contained PipelineBody
directly:
def run_pipeline_body(
pipeline_name: str,
pipeline_body: PipelineBody,
args_in: list[str] | None = None,
parse_args: bool | None = None,
dict_in: dict | None = None,
groups: list[str] | None = None,
success_group: str | None = None,
failure_group: str | None = None,
loader: str | None = None,
py_dir: str | bytes | PathLike | None = None
) -> Context
def run_pipeline_definition(
pipeline_name: str,
pipeline_definition: PipelineDefinition,
args_in: list[str] | None = None,
parse_args: bool | None = None,
dict_in: dict | None = None,
groups: list[str] | None = None,
success_group: str | None = None,
failure_group: str | None = None,
loader: str | None = None,
py_dir: str | bytes | PathLike | None = None
) -> Context
Introduce a validate switch from the cli:
$ pypyr validate my-dir/my-pipeline
# or maybe
$ pypyr my-dir/my-pipeline --validate
Introduce a new reserved key-word _meta
. Ad hoc yaml that is NOT structural
pipeline step-groups/steps will go under this new _meta
key. The ultimate plan
is to use _meta
for future plans to provide self-documenting pipelines with
author
and help
or description
sections for usage instructions and the
like.
This is necessary because in the new dispensation pypyr will attempt to parse the entire yaml and all of the yaml must conform to some sort of schema, because if we're back to allowing arbitrary yaml in a pipeline it would defeat the wish-list purpose of validation.
- No more ad hoc yaml in a pipeline - all yaml in pipeline must conform to schema.
- Custom loaders returning a
PipelineDefinition
will have to wrap a mapping (aka dict) into aPipelineBody
. Note that loaders returning aMapping
(e.g adict
) will keep on working as before, no changes necessary. - Broken pipelines that used to run until they got to a structurally malformed step will now not run at all, rather than run until it hits the breaking point.
- Existing pipelines with a
_meta
step-group will not work anymore.
pypyr does not lightly introduce breaking changes. They are annoying for anyone downstream. We have to measure carefully whether the benefits are worth the breakages. The goal is to minimize the scope and impact of breakages.
The breaking changes are the minimal necessary and inevitable to provide this long-asked-for feature to pypyr. Most users will not even notice a difference other than a more efficient development cycle with quicker validation feedback.
- No more ad hoc yaml in pipelines - there might well be pipelines out in the
wild that use custom yaml sections for documentation, meta-data and yaml
anchors. Remedying these to work in the new dispensation will just be copy/paste
of the non-pipeline yaml to go under the new
_meta
reserved key. - Custom pipeline loaders are relatively rare out in the wild - most pypyr
users stick to the built-in
file
loader. And loaders returning dicts/maps aren't affected at all, only ones using the relatively more recently introducedPipelineDefinition
. The code upgrade required here is minor - it's a one liner, which will be marked clearly in the release notes. - Pipelines with broken steps are definitionally broken. Although we are breaking backwards compatibility here in the strict sense, in that it could well result in failures of pipelines that at least partly worked before upgrading, this will be a major version upgrade and the change in behavior will be explicit in the release notes. And even so, these pipelines would have failed reporting the error, it's just that there might have been side-effects from preceding steps before the failure condition: the operator will be aware that the pipeline is quitting reporting failure and that it's not "working" well before this "breaking change" arrives. The exception is if a pipeline contained steps (e.g in a failure handler) that never ran before, in which case the operator might be unaware that the pipeline contains broken steps. Even in this case, the breaking change would highlight this problem and provide opportunity to remedy in advance rather than have it become a problem only on the first iteration of such broken steps.
- Hopefully
_meta
is an unusual enough term for a custom group that it won't clash with existing group-names. If it does, the fix is easy enough - just rename to anything other than_meta
- even a one character change tometa
will do. To minimize impact, will introduce an explicit error message to make it clear to the operator why it's not working if there is an attempt to run a_meta
group.
Keep pypyr as it is, but create an entirely separate validator.
Pros:
- no breaking changes to the pypyr core
Cons:
- maintain pipeline structural logic in two different places
- can't create pipelines as code
Introduce an entirely new layer of model classes to model pipeline structure.
Map these models to existing dsl classes.
Pros:
- no breaking changes
Cons:
- have to maintain pipeline structure in separate places, and also maintain the mapping/translation between these.
Create a YAML schema that code editors can use for syntax highlighting.
This does not actually build validation into the pypyr core, nor allows for pipelines as code.
This is more of an optional extra we can do at a later date, regardless of the current architectural decision.
Implement proposal. The breaking changes are limited enough in scope compared to the tangible benefits for the pypyr-verse.