Scientific Exploration Pipelines
Over the last decade software engineering and systems administration communities (also referred to as DevOps) have developed sophisticated techniques and strategies to ensure “software reproducibility”, i.e. the reproducibility of software artifacts and their behavior using versioning, dependency management, containerization, orchestration, monitoring, testing and documentation. The key idea behind the Popper protocol is to manage every experiment in computation and data exploration as a software project, using tools and services that are readily available now and enjoy wide popularity. By doing so, scientific explorations become reproducible with the same convenience, efficiency, and scalability as software repeatable while fully leveraging continuing improvements to these tools and services. Rather than mandating a particular set of tools, the convention only expects components of an experiment to be scripted. There are two main goals for Popper:
- It should be usable in as many research projects as possible, regardless of their domain.
- It should abstract underlying technologies without requiring a strict set of tools, making it possible to apply it on multiple toolchains.
A common generic analysis/experimentation workflow involving a computational component is the one shown below. We refer to this as a pipeline in order to abstract from experiments, simulations, analysis and other types of scientific explorations. Although there are some projects that don't fit this description, we focus on this model since it covers a large portion of pipelines out there. Typically, the implementation and documentation of a scientific exploration is commonly done in an ad-hoc way (custom bash scripts, storing in local archives, etc.).
The idea behind Popper is simple: make an article self-contained by including in a code repository the manuscript along with every experiment's scripts, inputs, parametrization, results and validation. To this end we propose leveraging state-of-the-art technologies and applying a DevOps approach to the implementation of scientific pipelines (also referred to SciOps).
Popper is a convention (or protocol) that maps the implementation of a pipeline to software engineering (and DevOps/SciOps) best-practices followed in open-source software projects. If a pipeline is implemented by following the Popper convention, we call it a popper-compliant pipeline or popper pipeline for short. A popper pipeline is implemented using DevOps tools (e.g., version-control systems, lightweight OS-level virtualization, automated multi-node orchestration, continuous integration and web-based data visualization), which makes it easier to re-execute and validate.
We say that an article (or a repository) is Popper-compliant if its scripts, dependencies, parameterization, results and validations are all in the same respository (i.e., the pipeline is self-contained). If resources are available, one should be able to easily re-execute a popper pipeline in its entirety. Additionally, the commit log becomes the lab notebook, which makes the history of changes made to it available to readers, an invaluable tool to learn from others and "stand on the shoulder of giants". A "popperized" pipeline also makes it easier to advance the state-of-the-art, since it becomes easier to extend existing work by applying the same model of development in OSS (fork, make changes, publish new findings).
The general repository structure is simple: a
folders on the root of the project with one subfolder per pipeline
$> tree mypaper/ ├── pipelines │ ├── exp1 │ │ ├── README.md │ │ ├── output │ │ │ ├── exp1.csv │ │ │ ├── post.sh │ │ │ └── view.ipynb │ │ ├── run.sh │ │ ├── setup.sh │ │ ├── teardown.sh │ │ └── validate.sh │ ├── analysis1 │ │ ├── README.md │ │ └── ... │ └── analysis2 │ ├── README.md │ └── ... └── paper ├── build.sh ├── figures/ ├── paper.tex └── refs.bib
Pipeline Folder Structure
A minimal pipeline folder structure for an experiment or analysis is shown below:
$> tree -a paper-repo/pipelines/myexp paper-repo/pipelines/myexp/ ├── README.md ├── post-run.sh ├── run.sh ├── setup.sh ├── teardown.sh └── validate.sh
Every pipeline has
teardown.sh scripts that serve as the entrypoints to each of the
stages of a pipeline. All these return non-zero exit codes if there's
a failure. In the case of
validate.sh, this script should print to
standard output one line per validation, denoting whether a validation
passed or not. In general, the form for validation results is
[true|false] <statement> (see examples below).
[true] algorithm A outperforms B [false] network throughput is 2x the IO bandwidth
The CLI tool includes a
pipeline init subcommand that can be executed to scaffold a pipeline
with the above structure. The syntax of this command is:
popper pipeline init <name>
<name> is the name of the pipeline to initialize. More details
on how pipelines are executed is presented in the next section.
Popper vs. Other Software
With the goal of putting Popper in context, the following is a list of comparisons with other existing tools.
Scientific Workflow Engines
Scientific workflow engines are "a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application." Taverna and Pegasus are examples of widely used scientific workflow engines. For a comprehensive list, see here.
A Popper pipeline can be seen as the highest-level workflow of a
scientific exploration, the one which users or automation services
interact with (which can be visualized by doing
popper workflow). A
stage in a popper pipeline can itself trigger the execution of a
workflow on one of the aforementioned workflow engines. A way to
visualize this is shown in the following image:
The above corresponds to a pipeline whose
run.sh stage triggers the
execution of a workflow for a numeric weather prediction setup (the
code is available here).
Ideally, the workflow specification files (e.g. in
CWP format) would be stored in the
repository and be passed as parameter in a bash script that is part of
a popper pipeline. For an example of a popper pipeline using the
Toil genomics workflow engine,
Virtualenv, Conda, Packrat, etc.
Language runtime-specific tools for Python, R, and others, provide the
ability of recreating and isolating environments with all the
dependencies that are needed by an application that is written in one
of these languages. For example
virtualenv can be used to create an
isolated environment with all the dependencies of a python
application, including the version of the python runtime itself. This
is a lightweight way of creating portable pipelines.
Popper pipelines automate and create an explicit record of the steps that need to be followed in order to create these isolated environments. For an example of a pipeline of this kind, see here.
For pipelines that execute programs written in statically typed languages (e.g. C++), these types of tools are not a good fit and other "full system" virtualization solutions such as Docker or Vagrant might be a better alternative. For an example of such a pipeline, see here.
Continuous Integration (CI) is a development practice where developers integrate and deploy code frequently with the purpose of catching errors as early as possible. The pipelines associated to an article can benefit from using CI services. If the output of a pipeline can be verified and validated by codifying any expectation, in the form of a unit test (a command returning a boolean value), this can be tested on every change to pipeline scripts.
Travis CI is an open-source, hosted, distributed continuous integration service used to build and test software projects hosted at GitHub. Alternatives to Travis CI are CircleCI and CodeShip. Other self-hosted solutions exist such as Jenkins. Each of these services require users to specify and automate tests using their own configuration files (or domain specific languages).
Popper can be seen as a service-agnostic way of automating the
execution of a pipeline on CI services with minimal effort. The
popper ci command generates configuration
files that a CI service reads in order
to execute a pipeline. Additionally, Popper can be used to test a
pipeline locally. Lastly, since
the concept of a pipeline and validations associated to them is a
first-class citizen in Popper, we can not only check that a pipeline
can execute correctly (
FAIL statuses) but we can also
verify that the output is the one expected by the original
implementers as explained
Reprozip / Sciunit
Reprozip "allows you to pack your
research along with all necessary data files, libraries, environment
variables and options", while Sciunit "are efficient,
lightweight, self-contained packages of computational experiments that
can be guaranteed to repeat or reproduce regardless of deployment
issues". They accomplish this by making use of
ptrace to track all
dependencies of an application. Popper can help in automating the tasks
required to install these tools; create and execute
Reprozip packages and Sciunits; and re-execute experiments in order
to verify that results are being reproduced.