# SoS Docker Guide

## General introduction

### What is docker and why it is helpful

This is a big question to answer but in essence you can think docker containers as virtual machines with applications but without the bulky OS part, or applications with stripped down OSes. Docker containers are much more lightweight than virtual machines because all docker containers share the same core OS and related containers (e.g. different applications derived from the same CentOS or Ubuntu OS) share the same base container. Please refer to the [docker website](https://www.docker.com/) for details about docker. I have found it helpful to watch a few youtube videos on docker.

The reason why docker is very helpful in building (bioinformatics) workflows are that 

1. Applications are encapsulated in docker containers so that they do not interfere with the underlying OS, and with other applications. For example, we can run a workflow with applications that based on different versions of Python2 and Python 3 without having to install them locally and calling the correct version of Python, because all applications use the specific version of Python and required libraries and tools inside their own containers.

2. Workflows will be more stable and reproducible because unlike, for example, a local installation of Python that can be affected by other software and upgrades of python, Docker containers are stable and will not change.

3. The same docker containers can be executed on different OS (e.g. various version of Linux, MacOSX etc) so your workflow built on a Mac OS workstation can be executed on a cluster environment.

There are of course some complexity in the use of docker but SoS has made it extremely easy to use docker in your workflows. 

### Installing and configuring docker

Docker is relatively new and is evolving very fast. It is crucial for you to install the latest version from [docker website](https://www.docker.com/). This website provides very detailed step by step instruction and you should have no problem installing docker on your machine. 

After installation, you should be able to start a docker terminal and run command

```bash
$ docker run hello-world
```

as suggested by the documentation. Depending on the different versions of docker (e.g. docker under windows), docker might be run under a virtual machine. It is very important to understand that **the configuration (e.g. RAM, CPU) of docker machines are different from the host machines** so your docker machine might be restricuted to, for example, 1 CPU, 1G of RAM, which is insufficient for any serious work. You will most likely need to re-configure your docker virtual machine (e.g. from VirtualBox app locate a machine named `default`).

## Running a script inside docker

Running a docker-based workflow is easy because SoS will automatically download docker images and execute scripts inside docker container. Anyway, before you start any workflow running docker, it is a good idea to check if your docker daemon is running by

In [1]:
!docker ps

CONTAINER ID        IMAGE               COMMAND               CREATED             STATUS              PORTS                   NAMES
3a8fec17646e        eg_sshd             "/usr/sbin/sshd -D"   3 days ago          Up 3 days           0.0.0.0:32768->22/tcp   test_sos


### How SoS works with docker

Suppose you do not have ruby installed locally and would like to run a ruby script, you can execute it inside a `ruby` container.

In [3]:
%run
ruby: docker_image='ruby'
    line1 = "Cats are smarter than dogs";
    line2 = "Dogs also like meat";

    if ( line1 =~ /Cats(.*)/ )
      puts "Line1 contains Cats"
    end
    if ( line2 =~ /Cats(.*)/ )
      puts "Line2 contains  Dogs"
    end

Line1 contains Cats


If you run the script with option `-v3`, you would see a line showing the actual `docker run` command executed by SoS. The command would look similar to

```
docker run --rm   -v /Users:/Users -v /tmp:/tmp 
    -v /tmp/path/to/docker_run_30258.rb:/var/lib/sos/docker_run_30258.rb
    -t -P 
    -w=/Users/bpeng1/sos/sos-docs/src/tutorials
    -u 12345:54321    ruby
    ruby /var/lib/sos/docker_run_30258.rb
```

Basically, SoS downloads a docker image called `ruby` and runs command `docker run` to execte the specified script, with the following options

* `--rm` Automatically remove the container when it exits
* `-v /Users:/Users` `-v /tmp:/tmp` maps local directories `/Users` and `/tmp` to the docker image so that these directories can be accessed from within the docker image.
* `-v /tmp/path/to/docker_run_30258.rb:/var/lib/sos/docker_run_30258.rb` maps a temporary script (`/Users/bpeng1/sos/sos-docs/src/tutorials/tmp2zviq3qh/docker_run_30258.rb` to the docker image.
* `-t` Allocate a pseudo-tty
* `-P` Publish all exposed ports to the host interfaces
* `-w=/Users/bpeng1/sos/sos-docs/src/tutorials` Set working directory to current working directory
* `-u 12345:54321` Use the host user-id and group-id inside docker so that files created by docker (on shared volumes) could be accessible from outside of docker.
* `ruby` name of the docker image
* `ruby` name of the interpreter for the script
* `/var/lib/sos/docker_run_30258.rb` the script inside of docker

The details of these options could be found at the [docker run manual](https://docs.docker.com/engine/reference/run/). They are chosen by the default to work with a majority of the scenarios but can fail for some docker images, in which case you will need to use SoS action parameters to customized the way the images are executed. These parameters include general [action parameters](https://vatlab.github.io/sos-docs/doc/documentation/Targets_and_Actions.html#Action-options-12) and [parameters that are specific to `docker_image`](https://vatlab.github.io/sos-docs/doc/documentation/Targets_and_Actions.html#docker_image).

### Building docker-image on-the-fly

Building a docker image is usually done outside of SoS if you are maintaining a collection of docker containers to be shared by your workflows, your groups, or everyone. However, if you need to create a docker image on-the-fly or would like to embed the Dockerfile inside a SoS script, you can use the `docker_build` action to build a docker container.

For example, you can build simple image

In [9]:
docker_build: tag='test_docker'
  FROM ubuntu:14.04


0

and use the image

In [10]:
sh: docker_image='test_docker'
  ls /usr

bin  games  include  lib  local  sbin  share  src


This tutorial will build a series of simple docker images to demonstrate the use of various options.

### Customized or image-default working directory (`workdir` and `docker_workdir`)

SoS by default sets the current working directory of the docker image to the working directory of the host system, essentially adding `-w $(pwd)` to the command line. For example, with the following docker image, the `pwd` of the script is the current working directory on the host machine.

In [3]:
sh: docker_image='ubuntu:14.04'
  echo `pwd`

/Users/bpeng1/sos/sos-docs/src/tutorials


Since the action option `workdir` can change the working directory of the script, you can use this option to change the script of the working directory of the docker image as well. For example,

In [4]:
sh: docker_image='ubuntu:14.04', workdir='..'
  echo `pwd`

/Users/bpeng1/sos/sos-docs/src


This default behavior is convenient when you use commands in a docker machine to process input files on your host machine but it has a few caveats:

1. The current working directory might not be accessible to the docker image.
2. The docker machine might have its own `WORKDIR` for the command to work.
3. You might want to specify another working directory inside of docker.

The first problem is solved by SoS' attemp to map current working directory to the docker image and will be discussed in the next section. The second and third problem can be addressed by another option `docker_workdir`.

Option `docker_workdir`, if specified, overrides `workdir` and allows the use of default or customized working directory inside of docker image. When `docker_workdir` is set to `None`, no `-w` option will be passed to the docker image and the default `WORKDIR` will be used. Otherwise an absolute path inside the docker image can be specified.

For example, the following customized docker image has a `WORKDIR` set to `/usr`. It is working directory is set to host working directory by default, to `/usr` with `docker_workdir=None`, and `/home/random_user` with `docker_workdir='/home/random_user'`.

In [8]:
docker_build: tag='test_docker_workdir'
  FROM ubuntu:14.04
  WORKDIR /usr

sh: docker_image='test_docker_workdir'
  echo `pwd`
  
sh: docker_image='test_docker_workdir', docker_workdir=None
  echo `pwd`
  
sh: docker_image='test_docker_workdir', docker_workdir='/home/random_user'
  echo `pwd`

/Users/bpeng1/sos/sos-docs/src/tutorials
/usr
/home/random_user


Note the directory is relative to the docker file system so it does not have to exist on the host system. Docker also creates the `docker_workdir` if it does not exist so you do not have to create the directory in advance. 

### Sharing of input and output files (`volumes`)

Because the working directory of the docker image is set by default to the current working directory, you can apply a command inside a docker image to files in the current working directory, and create files on it as well.

sh: docker_image='ubuntu:14.04'
  wc -l SoS_Docker_Guide.ipynb > docker_wc.txt
  
sh:
  cat docker_wc.txt

This works because SoS automatically shares the home directory  (`/Users` for mac and `/home` for Linux) and current working directory (if not under `$HOME`) of the host system to the docker image. Because the docker image can only "see" file systems shared by command `docker run`, your script will fail in the following scenarios:

* Your input files or output files are on a separate file system (not under `/home` and `$(pwd)`.
* You cannot share home or current working directory to docker image because of possible side effects.

For example, if your script writes something to a mobile harddrive, the script could execute successfully on the host system

In [14]:
sh:
  wc -l SoS_Docker_Guide.ipynb > /Volumes/Mobile

but fail in a docker image because the image cannot see the `/Volumes` file system

In [13]:
sh: docker_image='ubuntu:14.04'
  wc -l SoS_Docker_Guide.ipynb > /Volumes/Mobile

/var/lib/sos/docker_run_39643.sh: 1: /var/lib/sos/docker_run_39643.sh: cannot create /Volumes/Mobile: Directory nonexistent


Executing script in docker returns an error (exitcode=2).
The script has been saved to /Users/bpeng1/sos/sos-docs/src/tutorials/.sos/docker_run_39643.sh. To reproduce the error please run:
``docker run --rm   -v /Users:/Users -v /tmp:/tmp -v /Users/bpeng1/sos/sos-docs/src/tutorials/.sos/docker_run_39643.sh:/var/lib/sos/docker_run_39643.sh    -t -P -w=/Users/bpeng1/sos/sos-docs/src/tutorials -u 1985961928:895809667    ubuntu:14.04 /bin/sh /var/lib/sos/docker_run_39643.sh``


The problem could be solved by specifying `/Volumes` in option `volumes`. This parameter

1. By default maps `/Users` to `/Users` (mac), `/home` to `/home`.
2. If specified, share only user-specified file systems. In this case you can specify different path names from host and docker file systems (e.g. `/Users:/home`)

Current working directory is always mapped if it is not under default or specified directories.

<div class="alert alert-info">
<strong>Note:</strong>
If docker is implemented as a virtual machine, the file systems that are available to docker will be limited by the shared file systems of the virtual machine. 
</div>

Now, if you would like to read input files from or write output files to another file system, you can add it to option `volumes`

In [6]:
sh: docker_image='ubuntu:14.04', volumes='/Volumes'
  wc -l SoS_Docker_Guide.ipynb > /Volumes/Mobile

sh:
  cat /Volumes/Mobile

1236 SoS_Docker_Guide.ipynb


In this case SoS only shared `/Volumes` to `/Volumes` and `$(PWD)` (current working directory) to `$(PWD)`. `$HOME` is no longer shared, which can be a good thing because sharing of home directory to docker host can cause unexpected conflicts between docker and host systems. For example, your `.R` directory, when mapped to a docker image, might change the behavior of the `R` command inside docker.

Finally, if you have to share your home directory to the docker image but do not want to expose your host settings to the image, you can map your local volumes under different names. For example, the following script maps current working directory as `/input` and destination directory as `/output` to the docker image.

In [2]:
sh: docker_image='ubuntu:14.04', volumes=['~:/input', '/Volumes:/output']
  wc -l /input/sos/sos-docs/src/tutorials/SoS_Docker_Guide.ipynb > /output/Mobile

sh:
  cat /Volumes/Mobile

1249 /input/sos/sos-docs/src/tutorials/SoS_Docker_Guide.ipynb


### Customize user and group ID (`user`)

In [None]:
TODO: explain uid and gid. Use of `--user uid` and `--user uid:gid`.
Use `--user None` to stop sending host uid/gid to docker.

In [None]:
docker_build: tag='test_docker_workdir'
  FROM ubuntu:14.04
  USER /blah

### Docker images with `entry_point`

Some docker image has an `entry_point` which determines the command that will be executed when the image is executed. If we run the script directly, our "command" (e.g. `ruby /var/lib/sos/docker_run_30258.rb` will be appended to the `entry_point` and will not be executed properly.

For example, docker image [`dceoy/gatk`](https://hub.docker.com/r/dceoy/gatk/~/dockerfile/) has an entry point

```
["java", "-jar", "/usr/local/src/gatk/build/libs/gatk.jar"]
```

and does not accept any additional interpreter. What we really need to do is to append "arguments" to this pre-specified command.

Recall that action `script` does not have a default interpreter, and option `args` can be used to construct a command line, we can use this docker image in the format of

In [4]:
script: args = 'Print Reads -h', docker_image = 'dceoy/gatk'

INFO: docker pull dceoy/gatk


[1m[31mUSAGE:  [32m<program name>[1m[31m [-h]

[0m[1m[31mAvailable Programs:
[0m[37m--------------------------------------------------------------------------------------
[0m[31mBase Calling:                                    Tools that process sequencing machine data, e.g. Illumina base calls, and detect sequencing level attributes, e.g. adapters[0m
[32m    CheckIlluminaDirectory (Picard)              [36mAsserts the validity for specified Illumina basecalling data.  [0m
[32m    CollectIlluminaBasecallingMetrics (Picard)   [36mCollects Illumina Basecalling metrics for a sequencing run.  [0m
[32m    CollectIlluminaLaneMetrics (Picard)          [36mCollects Illumina lane metrics for the given BaseCalling analysis directory.  [0m
[32m    ExtractIlluminaBarcodes (Picard)             [36mTool determines the barcode for each read in an Illumina lane.  [0m
[32m    IlluminaBasecallsToFastq (Picard)            [36mGenerate FASTQ file(s) from Illumina basecall read d

[32m    CollectRnaSeqMetrics (Picard)                [36mProduces RNA alignment metrics for a SAM or BAM file.  [0m
[32m    CollectRrbsMetrics (Picard)                  [36m<b>Collects metrics from reduced representation bisulfite sequencing (Rrbs) data.</b>  [0m
[32m    CollectSequencingArtifactMetrics (Picard)    [36mCollect metrics to quantify single-base sequencing artifacts.  [0m
[32m    CollectTargetedPcrMetrics (Picard)           [36mCalculate PCR-related metrics from targeted sequencing data. [0m
[32m    CollectVariantCallingMetrics (Picard)        [36mCollects per-sample and aggregate (spanning all samples) metrics from the provided VCF file[0m
[32m    CollectWgsMetrics (Picard)                   [36mCollect metrics about coverage and performance of whole genome sequencing (WGS) experiments.[0m
[32m    CollectWgsMetricsWithNonZeroCoverage (Picard)[31m(BETA Tool) [36m(Experimental) Collect metrics about coverage and performance of whole genome sequencing (W

[32m    DownsampleSam (Picard)                       [36mDownsample a SAM or BAM file.[0m
[32m    ExtractOriginalAlignmentRecordsByNameSpark   [31m(BETA Tool) [36mSubsets reads by name[0m
[32m    FastqToSam (Picard)                          [36mConverts a FASTQ file to an unaligned BAM or SAM file[0m
[32m    FilterSamReads (Picard)                      [36mSubsets reads from a SAM or BAM file by applying one of several filters.[0m
[32m    FixMateInformation (Picard)                  [36mVerify mate-pair information between mates and fix if needed.[0m
[32m    FixMisencodedBaseQualityReads                [36mFix Illumina base quality scores in a SAM/BAM/CRAM file[0m
[32m    GatherBamFiles (Picard)                      [36mConcatenate efficiently BAM files that resulted from a scattered parallel analysis[0m
[32m    LeftAlignIndels                              [36mLeft-aligns indels from reads in a SAM/BAM/CRAM file[0m
[32m    MarkDuplicates (Picard)              

[32m    AnnotateVcfWithExpectedAlleleFraction        [36m(Internal) Annotate a vcf with expected allele fractions in pooled sequencing[0m
[32m    CalculateGenotypePosteriors                  [36mCalculate genotype posterior probabilities given family and/or known population genotypes[0m
[32m    CalculateMixingFractions                     [36m(Internal) Calculate proportions of different samples in a pooled bam[0m
[32m    Concordance                                  [31m(BETA Tool) [36mEvaluate concordance of an input VCF against a validated truth VCF[0m
[32m    CountFalsePositives                          [31m(BETA Tool) [36mCount PASS variants[0m
[32m    CountVariants                                [36mCounts variant records in a VCF file, regardless of filter status.[0m
[32m    CountVariantsSpark                           [31m(BETA Tool) [36mCountVariants on Spark[0m
[32m    FindMendelianViolations (Picard)             [36mFinds mendelian violations of all 

Executing script in docker returns an error (exitcode=2).
The script has been saved to /Users/bpeng1/sos/sos-docs/src/tutorials/.sos/docker_run_30258.sh. To reproduce the error please run:
``docker run --rm   -v /Users:/Users -v /tmp:/tmp -v /Users/bpeng1/sos/sos-docs/src/tutorials/.sos/docker_run_30258.sh:/var/lib/sos/docker_run_30258.sh    -t -P -w=/Users/bpeng1/sos/sos-docs/src/tutorials -u 1985961928:895809667    dceoy/gatk  Print Reads -h``


which essentially passes `Print Reads -h` to the image and executes command 
```
java -jar /usr/local/src/gatk/build/libs/gatk.jar Print Reads -h
```

If the command line is long, you can use another trick, that is to say, to use `{script}` in `args` for scripts of the action. For example, the aforementioned command can be specified as

In [5]:
script: args = '{script}', docker_image = 'dceoy/gatk'
  Print Reads -h

[1m[31mUSAGE:  [32m<program name>[1m[31m [-h]

[0m[1m[31mAvailable Programs:
[0m[37m--------------------------------------------------------------------------------------
[0m[31mBase Calling:                                    Tools that process sequencing machine data, e.g. Illumina base calls, and detect sequencing level attributes, e.g. adapters[0m
[32m    CheckIlluminaDirectory (Picard)              [36mAsserts the validity for specified Illumina basecalling data.  [0m
[32m    CollectIlluminaBasecallingMetrics (Picard)   [36mCollects Illumina Basecalling metrics for a sequencing run.  [0m
[32m    CollectIlluminaLaneMetrics (Picard)          [36mCollects Illumina lane metrics for the given BaseCalling analysis directory.  [0m
[32m    ExtractIlluminaBarcodes (Picard)             [36mTool determines the barcode for each read in an Illumina lane.  [0m
[32m    IlluminaBasecallsToFastq (Picard)            [36mGenerate FASTQ file(s) from Illumina basecall read d

[32m    CollectRnaSeqMetrics (Picard)                [36mProduces RNA alignment metrics for a SAM or BAM file.  [0m
[32m    CollectRrbsMetrics (Picard)                  [36m<b>Collects metrics from reduced representation bisulfite sequencing (Rrbs) data.</b>  [0m
[32m    CollectSequencingArtifactMetrics (Picard)    [36mCollect metrics to quantify single-base sequencing artifacts.  [0m
[32m    CollectTargetedPcrMetrics (Picard)           [36mCalculate PCR-related metrics from targeted sequencing data. [0m
[32m    CollectVariantCallingMetrics (Picard)        [36mCollects per-sample and aggregate (spanning all samples) metrics from the provided VCF file[0m
[32m    CollectWgsMetrics (Picard)                   [36mCollect metrics about coverage and performance of whole genome sequencing (WGS) experiments.[0m
[32m    CollectWgsMetricsWithNonZeroCoverage (Picard)[31m(BETA Tool) [36m(Experimental) Collect metrics about coverage and performance of whole genome sequencing (W

[32m    DownsampleSam (Picard)                       [36mDownsample a SAM or BAM file.[0m
[32m    ExtractOriginalAlignmentRecordsByNameSpark   [31m(BETA Tool) [36mSubsets reads by name[0m
[32m    FastqToSam (Picard)                          [36mConverts a FASTQ file to an unaligned BAM or SAM file[0m
[32m    FilterSamReads (Picard)                      [36mSubsets reads from a SAM or BAM file by applying one of several filters.[0m
[32m    FixMateInformation (Picard)                  [36mVerify mate-pair information between mates and fix if needed.[0m
[32m    FixMisencodedBaseQualityReads                [36mFix Illumina base quality scores in a SAM/BAM/CRAM file[0m
[32m    GatherBamFiles (Picard)                      [36mConcatenate efficiently BAM files that resulted from a scattered parallel analysis[0m
[32m    LeftAlignIndels                              [36mLeft-aligns indels from reads in a SAM/BAM/CRAM file[0m
[32m    MarkDuplicates (Picard)              

[32m    AnnotateVcfWithExpectedAlleleFraction        [36m(Internal) Annotate a vcf with expected allele fractions in pooled sequencing[0m
[32m    CalculateGenotypePosteriors                  [36mCalculate genotype posterior probabilities given family and/or known population genotypes[0m
[32m    CalculateMixingFractions                     [36m(Internal) Calculate proportions of different samples in a pooled bam[0m
[32m    Concordance                                  [31m(BETA Tool) [36mEvaluate concordance of an input VCF against a validated truth VCF[0m
[32m    CountFalsePositives                          [31m(BETA Tool) [36mCount PASS variants[0m
[32m    CountVariants                                [36mCounts variant records in a VCF file, regardless of filter status.[0m
[32m    CountVariantsSpark                           [31m(BETA Tool) [36mCountVariants on Spark[0m
[32m    FindMendelianViolations (Picard)             [36mFinds mendelian violations of all 

Executing script in docker returns an error (exitcode=2).
The script has been saved to /Users/bpeng1/sos/sos-docs/src/tutorials/.sos/docker_run_30258.sh. To reproduce the error please run:
``docker run --rm   -v /Users:/Users -v /tmp:/tmp -v /Users/bpeng1/sos/sos-docs/src/tutorials/.sos/docker_run_30258.sh:/var/lib/sos/docker_run_30258.sh    -t -P -w=/Users/bpeng1/sos/sos-docs/src/tutorials -u 1985961928:895809667    dceoy/gatk  Print Reads -h
``


## Limitations

* Virtual Box virtual machine does not support symbolic link so running `ln -s` inside a docker machine under Mac will cause a strange error message `Read-only file system`.
* Killing a sos task or sos process will not terminate scripts that are executed by the docker daemon.