Skip to content

Pipeline variables from params file #3633

@dmpetrov

Description

@dmpetrov

With the introduction of the new multiple-stage pipeline, we will need to find a way of defining variables in the pipeline. For example, the intermediate file name cleansed.csv is used from two stages in the following pipeline and it needs to be defined into a variable:

stages:
    process:
        cmd: "./process.bin --input data --output cleansed.csv"
        deps:
             - path: data/
        outs:
             - path: cleansed.csv

    train:
        cmd: "python train.py"
        deps:
             - path: cleansed.csv
             - path: train.py
             - path: params.yaml
                params:
                     lr: 0.042
                     layers: 8
                     classes: 4
        outs:
             - path: model.pkl
             - path: log.csv
                cache: true
             - path: summary.json

We need to solve two problems here:

  1. Define a variable in one place and reuse it from multiple places/stages.
  2. Often users prefer to read file names from config files (like in the train stage), not from the command line (like in the process stage).

We can solve both of the problems using a single abstraction - parameters file variable:

stages:
    process:
        cmd: ./process.bin
        outs:
             - path: "params.yaml:cleansed_file_name"
        ....
    train:
        cmd: "python train.py"
        deps:
             - path: "params.yaml:cleansed_file_name"

This feature is useful in the current DVC design as well. It is convenient to read file names from params file and still define dependency properly like dvc run -d params.yaml:input_file -o params.yaml:model.pkl

Metadata

Metadata

Assignees

Labels

A: templatingRelated to the templating featurefeature requestRequesting a new feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions