SciPipe is a library for writing scientific workflows (sometimes also called "pipelines") of shell commands that depend on each other, in the Go programming language (aka golang). It was initially designed for problems in cheminformatics and bioinformatics, but applies equally well to any domain involving complex pipelines of interdependent shel…
Go Shell
Clone or download
Permalink
Failed to load latest commit information.
cmd/scipipe Make adjustments in Audit2Tex template Jul 26, 2018
components Remove accidentally introduced external deps Jul 25, 2018
docs Add docs sections on graph plotting and audit log conversion and clea… Jul 28, 2018
examples Use better naming for in-port in resequencing example Jul 30, 2018
.gitignore More patterns in .gitignore Jul 23, 2015
.travis.yml Add comment about building only main folder in travis May 5, 2017
LICENSE Initial commit Mar 7, 2015
README.md Enable new placeholder syntax, using pipe char to separate operations Jul 17, 2018
audit.go Change ordering of AuditInfo fields for better readability Jul 28, 2018
baseprocess.go Use temp dir instead of temp file ext, for atomizing, to simplify han… Jun 18, 2018
circle.yml Add CircleCI config with codecov integration May 20, 2017
const.go Bump version to 0.8 Jul 26, 2018
ip.go Fix minor lint inssue Jul 20, 2018
ip_test.go Use temp dir instead of temp file ext, for atomizing, to simplify han… Jun 18, 2018
log.go Add tests (and fix bugs) for file_splitter, while refactoring atomize… Jun 26, 2018
misc_test.go Improve test organization (just helper stuff in misc_test.go) Jun 21, 2018
mkdocs.yml Add docs sections on graph plotting and audit log conversion and clea… Jul 28, 2018
port.go Add pp.FromFloat() method Jul 19, 2018
port_test.go Refactor: Rename SetPathCustom -> SetOutFunc Jul 19, 2018
process.go Make task.Process public Jul 21, 2018
process_test.go API change: p.PathFormatters -> p.PathFuncs Jul 19, 2018
sink.go Change some remaining variable names after API change Jun 13, 2018
task.go Fix bug: existing temp directories were not properly identified Aug 14, 2018
task_test.go Fix bug: existing temp directories were not properly identified Aug 14, 2018
testcov.sh Add CircleCI config with codecov integration May 20, 2017
todo.md Update todos Jun 21, 2018
updatedocs.sh Fix quoting issue in updatedocs.sh script May 5, 2017
utils.go Remove accidentally introduced external deps Jul 25, 2018
utils_test.go Add more tests Jul 25, 2018
workflow.go Clean portname up to last dot, if multiple dots in name Jul 22, 2018
workflow_test.go Make workflow plots left-to-right instead of top-to-down Jul 19, 2018

README.md

SciPipe

Build Status Test Coverage Codebeat Grade Go Report Card GoDoc Gitter DOI

Project links: Documentation & Main Website | Issue Tracker | Mailing List

Project updates

Introduction

SciPipe is a library for writing Scientific Workflows, sometimes also called "pipelines", in the Go programming language.

When you need to run many commandline programs that depend on each other in complex ways, SciPipe helps by making the process of running these programs flexible, robust and reproducible. SciPipe also lets you restart an interrupted run without over-writing already produced output and produces an audit report of what was run, among many other things.

SciPipe is built on the proven principles of Flow-Based Programming (FBP) to achieve maximum flexibility, productivity and agility when designing workflows. Compared to plain dataflow, FBP provides the benefits that processes are fully self-contained, so that a library of re-usable components can be created, and plugged into new workflows ad-hoc.

Similar to other FBP systems, SciPipe workflows can be likened to a network of assembly lines in a factory, where items (files) are flowing through a network of conveyor belts, stopping at different independently running stations (processes) for processing, as depicted in the picture above.

SciPipe was initially created for problems in bioinformatics and cheminformatics, but works equally well for any problem involving pipelines of commandline applications.

Project status: SciPipe is still alpha software and minor breaking API changes still happens as we try to streamline the process of writing workflows. Please follow the commit history closely for any API updates if you have code already written in SciPipe (Let us know if you need any help in migrating code to the latest API).

Benefits

Some key benefits of SciPipe, that are not always found in similar systems:

  • Intuitive behaviour: SciPipe operates by flowing data (files) through a network of channels and processes, not unlike the conveyor belts and stations in a factory.
  • Flexible: Processes that wrap command-line programs or scripts, can be combined with processes coded directly in Golang.
  • Custom file naming: SciPipe gives you full control over how files are named, making it easy to find your way among the output files of your workflow.
  • Portable: Workflows can be distributed either as Go code to be run with go run, or as stand-alone executable files that run on almost any UNIX-like operating system.
  • Easy to debug: As everything in SciPipe is just Go code, you can use some of the available debugging tools, or just println() statements, to debug your workflow.
  • Supports streaming: Can stream outputs via UNIX FIFO files, to avoid temporary storage.
  • Efficient and Parallel: Workflows are compiled into statically compiled code that runs fast. SciPipe also leverages pipeline parallelism between processes as well as task parallelism when there are multiple inputs to a process, making efficient use of multiple CPU cores.

Known limitations

Hello World example

Let's look at an example workflow to get a feel for what writing workflows in SciPipe looks like:

package main

import (
    // Import SciPipe, aliased to sp
    sp "github.com/scipipe/scipipe"
)

func main() {
    // Init workflow and max concurrent tasks
    wf := sp.NewWorkflow("hello_world", 4)

    // Initialize processes, and file extensions
    hello := wf.NewProc("hello", "echo 'Hello ' > {o:out|.txt}")
    world := wf.NewProc("world", "echo $(cat {i:in}) World > {o:out|.txt}")

    // Define data flow
    world.In("in").From(hello.Out("out"))

    // Run workflow
    wf.Run()
}

Running the example

Let's put the code in a file named scipipe_helloworld.go and run it:

$ go run minimal.go
AUDIT   2018/07/17 21:42:26 | workflow:hello_world             | Starting workflow (Writing log to log/scipipe-20180717-214226-hello_world.log)
AUDIT   2018/07/17 21:42:26 | hello                            | Executing: echo 'Hello ' > hello.out.txt
AUDIT   2018/07/17 21:42:26 | hello                            | Finished: echo 'Hello ' > hello.out.txt
AUDIT   2018/07/17 21:42:26 | world                            | Executing: echo $(cat ../hello.out.txt) World > hello.out.txt.world.out.txt
AUDIT   2018/07/17 21:42:26 | world                            | Finished: echo $(cat ../hello.out.txt) World > hello.out.txt.world.out.txt
AUDIT   2018/07/17 21:42:26 | workflow:hello_world             | Finished workflow (Log written to log/scipipe-20180717-214226-hello_world.log)

Let's check what file SciPipe has generated:

$ ls -1 hello*
hello.out.txt
hello.out.txt.audit.json
hello.out.txt.world.out.txt
hello.out.txt.world.out.txt.audit.json

As you can see, it has created a file hello.out.txt, and hello.out.world.out.txt, and an accompanying .audit.json for each of these files.

Now, let's check the output of the final resulting file:

$ cat hello.out.txt.world.out.txt
Hello World

Now we can rejoice that it contains the text "Hello World", exactly as a proper Hello World example should :)

Now, these were a little long and cumbersome filenames, weren't they? SciPipe gives you very good control over how to name your files, if you don't want to rely on the automatic file naming. For example, we could set the first filename to a static one, and then use the first name as a basis for the file name for the second process, like so:

package main

import (
    // Import the SciPipe package, aliased to 'sp'
    sp "github.com/scipipe/scipipe"
)

func main() {
    // Init workflow with a name, and max concurrent tasks
    wf := sp.NewWorkflow("hello_world", 4)

    // Initialize processes and set output file paths
    hello := wf.NewProc("hello", "echo 'Hello ' > {o:out}")
    hello.SetOut("out", "hello.txt")

    world := wf.NewProc("world", "echo $(cat {i:in}) World >> {o:out}")
    // The modifier 's/.txt//' will replace '.txt' in the input path with ''
    world.SetOut("out", "{i:in|s/.txt//}_world.txt")

    // Connect network
    world.In("in").From(hello.Out("out"))

    // Run workflow
    wf.Run()
}

Now, if we run this, the file names get a little cleaner:

$ ls -1 hello*
hello.txt
hello.txt.audit.json
hello.txt.world.go
hello.txt.world.txt
hello.txt.world.txt.audit.json

The audit logs

Finally, we could have a look at one of those audit file created:

$ cat hello.txt.world.txt.audit.json
{
    "ID": "99i5vxhtd41pmaewc8pr",
    "ProcessName": "world",
    "Command": "echo $(cat hello.txt) World \u003e\u003e hello.txt.world.txt.tmp/hello.txt.world.txt",
    "Params": {},
    "Tags": {},
    "StartTime": "2018-06-15T19:10:37.955602979+02:00",
    "FinishTime": "2018-06-15T19:10:37.959410102+02:00",
    "ExecTimeNS": 3000000,
    "Upstream": {
        "hello.txt": {
            "ID": "w4oeiii9h5j7sckq7aqq",
            "ProcessName": "hello",
            "Command": "echo 'Hello ' \u003e hello.txt.tmp/hello.txt",
            "Params": {},
            "Tags": {},
            "StartTime": "2018-06-15T19:10:37.950032676+02:00",
            "FinishTime": "2018-06-15T19:10:37.95468214+02:00",
            "ExecTimeNS": 4000000,
            "Upstream": {}
        }
    }

Each such audit-file contains a hierarchic JSON-representation of the full workflow path that was executed in order to produce this file. On the first level is the command that directly produced the corresponding file, and then, indexed by their filenames, under "Upstream", there is a similar chunk describing how all of its input files were generated. This process will be repeated in a recursive way for large workflows, so that, for each file generated by the workflow, there is always a full, hierarchic, history of all the commands run - with their associated metadata - to produce that file.

You can find many more examples in the examples folder in the GitHub repo.

For more information about how to write workflows using SciPipe, and much more, see SciPipe website (scipipe.org)!

More material on SciPipe

Acknowledgements

Related tools

Find below a few tools that are more or less similar to SciPipe that are worth worth checking out before deciding on what tool fits you best (in approximate order of similarity to SciPipe):