Pipeline in a Docker container
This repository demonstrates how to build an entire data analysis pipeline that runs inside a Docker container.
Docker is a virtualization technology that can be used to bundle an application and all its dependencies in a virtual container that can be distributed and deployed to run reproducibly on any Windows, Linux or MacOS operating system. Docker is commonly used in computational biology to bundle a tool and all its dependencies such that it can essentially be used as an executable (see the StaPH-B Docker and slimbioinfo projects for examples). For instance, installing something as simple as bwa
on a modern ARM Macbook is challenging because of the way M1 Macs compile C code. However, a Docker container can be used where bwa
can be essentially replaced with docker run bwa ...
.
However, while containers are typically used to deploy microservices, containerization can also be used to deploy an entire data analysis pipeline. I illustrate this in pipebox using a fairly trivial data analysis pipeline involving read mapping and variant calling.
Inputs:
- Paired end reads
- A reference genome
- An output base filename
What the pipeline does:
- If the reference genome isn't indexed, it'll index it with both
samtools faidx
andbwa index
. - Very simple QC on the sequencing reads using
seqtk
. - Use
bwa mem
to map reads to the indexed reference genome. - Very simple QC on the alignments with
samtools stats
. - Use
bcftools mpileup
andbcftools call
to call variants. - Very simple QC on the alignments with
samtools stats
. - Very simple QC on the variant calls with
bcftools stats
. - Creates a simple single page PDF "report" with a few plots from the QC steps.
All of these data processing steps happen inside the container. You provide a volume mount to the container so that you can mount the local filesystem to a location inside the running container. Data processing happens inside the container. Output files are also written to a location that is mounted to the container (e.g., the present working directory). This pipeline will run on any system where Docker is available.
This repo assumes a minimal working understanding of containerization, including the difference between an image and a container, and how they're each created. If you're familiar with all these concepts, you can skip to the Demo section below.
A Docker image is like a blueprint or a snapshot of what a container will look like when it runs. Images are created from code using a Dockerfile, which is a set of instructions used to create the image. This is important, because with Dockerfiles we can define infrastructure in code -- code which can be easily version controlled and worked on collaboratively in a platform like GitHub. When we have our instructions for building an image written out in a Dockerfile, we can create an image using docker build
.
Once we have an image, we can "rehydrate" that image and bring to life a running container using docker run
. When a container is initialized, it spins up, does whatever it needs to do, and is destroyed. A container could serve a long-running process, such as a webservice or database. Here, we're spinning up a container that runs a script inside that container. When the script runs to completion, the container is destroyed. From a single image we can use docker run
to spin up one running container or one thousand running containers, all starting from the identical image.
Another important concept is a volume mount. A running container is a virtualized operating system with its own software and importantly, its own filesystem inside the container. If we have a plain Alpine linux container and run something like docker run alpine ls
, this will start the alpine container and run ls
in the default working directory, which is /
, inside the container. That is, you'll see the usual system folders like /bin
, /dev
, /etc
, and others. If you were to run docker run alpine ls /home
you'll see nothing - because the container has no users other than root!
This concept is important because if you're running tools inside the container, you must provide a way to have files on the host system (i.e., your computer or VM) to be accessible to the running container. To do this we use the --volume
or more commonly the short -v
flag, which allows us to mount local directories to the container. For instance, docker run -v /home/turner/mydata:/data
would mount /home/turner/mydata
from my laptop to /data
inside the running container. If I have a script that expects to see data in /data
inside the container, this would work well.
Here's a common trick that allows you to mount contents of the present working directory on your machine to the running container using shell substitution with $(pwd)
. Using -v $(pwd):(pwd)
mounts the present working directory on the host system to a path of the same name inside the container, and -w $(pwd)
sets the working directory inside the container to the same directory specified by the volume mount.
Further suggested reading:
- StaPH-B docker user guide: https://staphb.org/docker-builds/. This is an excellent guide to using Docker in bioinformatics written by a group of bioinformaticians working in public health labs. Start with the chapters in the upper-right. Specifically useful are the running containers chapter which provides additional detail on volume mounts, and developing containers chapter which discusses creating Dockerfiles, building images, and initializing containers.
- "Ten simple rules for writing Dockerfiles for reproducible data science." PLoS computational biology 16.11 (2020): e1008316. DOI:10.1371/journal.pcbi.1008316. This paper provides a great overview on containerization with a specific focus on the importance of Dockerfiles for reproducible data science and bioinformatics workflows.
- "pracpac: Practical R Packaging with Docker." arXiv:2303.07876 (2023). DOI:10.48550/arXiv.2303.07876. Shameless self-promo: I wrote this paper and the software it describes for automating the building and deployment of pipelines-as-Docker-containers with specific attention to pipelines which require a custom-built R package as part of the pipeline. The software is implemented as an R package and it's R-centric (Shiny, MLOps with tidymodels, etc.), but isn't limited solely to R and R-related tools, as demonstrated in the package vignettes.
First, get this repository and build the pipebox image.
git clone git@github.com:colossal-compsci/pipebox.git
cd pipebox
docker build --tag pipebox .
Alternatively, create a personal access token with read permissions for the GitHub Container Registry, login (once), then pull the container directly from the GitHub Container Registry.
## Run once
# export GHCR_PAT="YOUR_TOKEN_HERE"
# echo $GHCR_PAT | docker login ghcr.io -u USERNAME --password-stdin
# Pull the image
docker pull ghcr.io/colossal-compsci/pipebox
# Give it a short name so you can run it more easily
docker tag ghcr.io/colossal-compsci/pipebox pipebox
Running the container with no arguments prints a minimal help message (we use --rm
to destroy the container when it exits). The pipebox container is running the pipebox.sh script as its ENTRYPOINT
, which runs the entire pipeline.
docker run --rm pipebox
Usage: pipebox.sh <read1> <read2> <reffa> <outbase>
Let's run it on the test data in this repository. The -v $(pwd):(pwd)
mounts the present working directory on the host system to a path of the same name inside the container, while the -w $(pwd)
sets the working directory inside the container to the same directory specified by the volume mount. These two flags make it easy to make data on the host accessible to the running container.
cd testdata
docker run --rm -v $(pwd):$(pwd) -w $(pwd) pipebox \
SRR507778_1.fastq.gz \
SRR507778_2.fastq.gz \
yeastref.fa.gz \
results/SRR507778
You should see all the files in the testdata/results
folder:
SRR507778.seqtkfqchk.tsv
: Results from runningseqtk fqchk
after interleaving the paired FASTQ files.SRR507778.bam
: Sorted alignment.SRR507778.samtoolsstats.tsv
: Results from runningsamtools stats
on this alignment.SRR507778.vcf.gz
: Results from variant calling against the reference genome.SRR507778.bcfstats.tsv
: Results from runningbcftools stats
on these variant calls.SRR507778.metrics.pdf
: "Report" compiling a few outputs from each of the tools above (PDF format).SRR507778.metrics.png
: "Report" compiling a few outputs from each of the tools above (PNG format).
Let's start with a look at the Dockerfile. Rather than starting this image with a vanilla Debian or Ubuntu image, we actually start from an image that already has mamba (a faster/better conda) installed. If you're interested you can see how this image is built looking at its Dockerfile in the conda-forge/miniforge-images GitHub repo.
FROM condaforge/mambaforge
Next, we copy the environment.yml file in this repo, which lives on the host system, into the container at /
. Take a look at this environment.yml file. This defines all the dependencies we want to install via conda, and specifies the versions of some tools. Note that the name of the conda environment is base
-- that's because rather than creating a new conda environment with a different name, we'll just install all this stuff in the base environment, which is the environment that's running when this container spins itself up. The RUN
command runs the command inside the container image build to update the base environment with the YAML file that was copied into /
from the COPY
statement above.
COPY ./environment.yml /environment.yml
RUN mamba env update --file /environment.yml
In the next section, we're going to install seqtk from source. Seqtk is available via conda, but I want to demonstrate how to build something from source that might not be available via conda. First I need to install some basic utilities. The GNU C compiler, Make, and the zlib developmental libraries. I'm also installing vim in case I need to step into the running container and edit something while debugging. Next, I set a build variable specifying the version of seqtk I want to use, then set about downloading the source code, compiling from source, and installing inside the container image.
RUN apt update && apt install -y vim gcc make zlib1g-dev
ARG VERSION_SEQTK="1.4"
RUN wget -q https://github.com/lh3/seqtk/archive/refs/tags/v${VERSION_SEQTK}.tar.gz && \
tar xzf v${VERSION_SEQTK}.tar.gz && \
cd seqtk-${VERSION_SEQTK} && \
make && \
make install
Next, I'm copying the src
directory in this repo, and which will be on the host system, into the container build at /
. This is important in the next step. We'll get to what's in the src
directory in the next step. This means that whenever the container is instantiated, all the stuff in src
will be available inside the running container at /src/*
.
COPY ./src /src
Finally, I declare an ENTRYPOINT. This is the command that's run whenever the docker container is instantiated. It's the script we'll go over next.
ENTRYPOINT ["/bin/bash", "/src/pipebox.sh"]
When the container is initialized, the pipebox.sh pipeline script is run from /src/pipebox.sh
inside the container. This script expects four command-line arguments, as described above. Command line arguments must be specified to locations relative to the inside of the running container, hence the -v $(pwd):$(pwd) -w $(pwd)
trick described above to operate on files living in the present working directory on the host system.
You can view the pipebox.sh script itself to see what's going on. It's doing everything described above in the Motivation section. Everything here is happening via a shell script, operating on files inside the container, which are usually volume-mounted from the host. This pipeline script could be a shell script, python script, a shell script that launches a Nextflow run (if Nextflow is installed and configured inside the container), or any arbitrary code that's executed inside the running container on files volume-mounted from the host. This pipeline script could call any number of secondary scripts for further processing. In this example, it calls an R script to do post-processing on the results.
The last step that the pipebox.sh script performs is that it calls another script, pipebox-post.R, which is an R script that performs postprocessing on the outputs from seqtk
, samtools stats
, and bcftools stats
run by the main script. You can read the code in pipebox-post.R to get a sense of what it's doing -- essentially reading in and parsing the tabular output from these tools and stitching a few plots together. Outputs are shown above in the Results section.
The pipebox image can be built with the latest
and x.y.z
tag using the build.sh script in this container, optionally passing the --no-cache
flag. The version is scraped from a commented version string in this README. The image is also tagged with latest
and x.y.z
for the GitHub Container Registry namespace. To build and deploy:
./build.sh --no-cache
docker push --all-tags <registry>/<namespace>/pipebox