# SoS Docker Guide

## General introduction

### What is docker and why it is helpful

This is a big question to answer but in essence you can think docker containers as virtual machines with applications but without the bulky OS part, or applications with stripped down OSes. Docker containers are much more lightweight than virtual machines because all docker containers share the same core OS and related containers (e.g. different applications derived from the same CentOS or Ubuntu OS) share the same base container. Please refer to the [docker website](https://www.docker.com/) for details about docker. I have found it helpful to watch a few youtube videos on docker.

The reason why docker is very helpful in building (bioinformatics) workflows are that 

1. Applications are encapsulated in docker containers so that they do not interfere with the underlying OS, and with other applications. For example, we can run a workflow with applications that based on different versions of Python2 and Python 3 without having to install them locally and calling the correct version of Python, because all applications use the specific version of Python and required libraries and tools inside their own containers.

2. Workflows will be more stable and reproducible because unlike, for example, a local installation of Python that can be affected by other software and upgrades of python, Docker containers are stable and will not change.

3. The same docker containers can be executed on different OS (e.g. various version of Linux, MacOSX etc) so your workflow built on a Mac OS workstation can be executed on a cluster environment.

There are of course some complexity in the use of docker but SoS has made it extremely easy to use docker in your workflows. 

### Installing and configuring docker

Docker is relatively new and is evolving very fast. It is crucial for you to install the latest version from [docker website](https://www.docker.com/). This website provides very detailed step by step instruction and you should have no problem installing docker on your machine. 

After installation, you should be able to start a docker terminal and run command

```bash
$ docker run hello-world
```

as suggested by the documentation. Depending on the different versions of docker (e.g. docker under windows), docker might be run under a virtual machine. It is very important to understand that **the configuration (e.g. RAM, CPU) of docker machines are different from the host machines** so your docker machine might be restricuted to, for example, 1 CPU, 1G of RAM, which is insufficient for any serious work. You will most likely need to re-configure your docker virtual machine (e.g. from VirtualBox app locate a machine named `default`).

## Running a script inside docker

### How SoS works with docker

SoS executes scripts inside docker by calling command `docker run` with appropriate parameters. Suppose you do not have ruby installed locally and would like to run a ruby script, you can execute it inside a `ruby` container.

In [1]:
%run
ruby: docker_image='ruby'
    line1 = "Cats are smarter than dogs";
    line2 = "Dogs also like meat";

    if ( line1 =~ /Cats(.*)/ )
      puts "Line1 contains Cats"
    end
    if ( line2 =~ /Cats(.*)/ )
      puts "Line2 contains  Dogs"
    end

Line1 contains Cats


The actual `docker run` command executed by SoS could be shown with option `-v3` to the `%run` command, and would look similar to

```
docker run --rm  
    -v $(pwd):$(pwd)
    -v /tmp/path/to/docker_run_30258.rb:/var/lib/sos/docker_run_30258.rb
    -t -P 
    -w=/Users/bpeng1/sos/sos-docs/src/tutorials
    -u 12345:54321    ruby
    ruby /var/lib/sos/docker_run_30258.rb
```

Basically, SoS downloads a docker image called `ruby` and runs command `docker run` to execte the specified script, with the following options

* `--rm` Automatically remove the container when it exits
* `-v $(pwd):$(pwd)` maps current working directory to the docker image so that it can be accessed from within the docker image
* `-v /tmp/path/to/docker_run_30258.rb:/var/lib/sos/docker_run_30258.rb` maps a temporary script (`/Users/bpeng1/sos/sos-docs/src/tutorials/tmp2zviq3qh/docker_run_30258.rb` to the docker image.
* `-t` Allocate a pseudo-tty
* `-P` Publish all exposed ports to the host interfaces
* `-w=/Users/bpeng1/sos/sos-docs/src/tutorials` Set working directory to current working directory
* `-u 12345:54321` Use the host user-id and group-id inside docker so that files created by docker (on shared volumes) could be accessible from outside of docker.
* `ruby` name of the docker image
* `ruby /var/lib/sos/docker_run_30258.rb` command that execute the script.

The details of these options could be found at the [docker run manual](https://docs.docker.com/engine/reference/run/). They are chosen by the default to work with a majority of the scenarios but can fail for some docker images, in which case you will need to use SoS action parameters to customized how the images are executed. These parameters include general [action parameters](https://vatlab.github.io/sos-docs/doc/documentation/Targets_and_Actions.html#Action-options-12) and [parameters that are specific to `docker_image`](https://vatlab.github.io/sos-docs/doc/documentation/Targets_and_Actions.html#docker_image).

### Building docker-image (action `docker_build`)

Building a docker image is usually done outside of SoS if you are maintaining a collection of docker containers to be shared by your workflows, your groups, or everyone. However, if you need to create a docker image on-the-fly or would like to embed the Dockerfile inside a SoS script, you can use the `docker_build` action to build a docker container.

For example, you can build simple image

In [2]:
docker_build: tag='test_docker'
  FROM ubuntu:14.04


0

and use the image

In [3]:
sh: docker_image='test_docker'
  ls /usr

bin  games  include  lib  local  sbin  share  src


This tutorial will use the `docker_build` action to build a few simple docker images to demonstrate the use of various options.

### Customized working directory (`workdir` and `docker_workdir`)

SoS by default sets the current working directory of the docker image to the working directory of the host system, essentially adding `-w $(pwd)` to the command line. For example, with the following docker image, the `pwd` of the script is the current working directory on the host machine.

In [4]:
sh: docker_image='ubuntu:14.04'
  echo `pwd`

/Users/bpeng1/sos/sos-docs/src/tutorials


Since the action option `workdir` can change the working directory of the script, you can use this option to change the script of the working directory of the docker image as well. For example, SoS in the following example will change the current working directory to the parent directory before executing `docker run` there.

In [5]:
sh: docker_image='ubuntu:14.04', workdir='..'
  echo `pwd`

/Users/bpeng1/sos/sos-docs/src


This default behavior is convenient when you use commands in docker images to process input files on your host machine but it has two caveats:

1. The docker image might have its own `WORKDIR` for the command to work. For example, a docker image can provide an application that is not in standard `$PATH` so it can only be executed in a specified `WORKDIR`.
2. You might need to specify a working directory inside of docker that does not exist in the host machine.

Option `docker_workdir`, if specified, overrides `workdir` and allows the use of default or customized working directory inside of docker images. When `docker_workdir` is set to `None`, no `-w` option will be passed to the docker image and the default `WORKDIR` will be used. Otherwise an absolute path inside the docker image can be specified.

For example, the following customized docker image has a `WORKDIR` set to `/usr`. It is working directory is set to host working directory by default, to `/usr` with `docker_workdir=None`, and `/home/random_user` with `docker_workdir='/home/random_user'`.

In [6]:
docker_build: tag='test_docker_workdir'
  FROM ubuntu:14.04
  WORKDIR /usr

sh: docker_image='test_docker_workdir'
  echo `pwd`
  
sh: docker_image='test_docker_workdir', docker_workdir=None
  echo `pwd`
  
sh: docker_image='test_docker_workdir', docker_workdir='/home/random_user'
  echo `pwd`

/Users/bpeng1/sos/sos-docs/src/tutorials
/usr
/home/random_user


Note the directory is relative to the docker file system so it does not have to exist on the host system. Docker also creates `docker_workdir` if it does not exist so you do not have to create the directory in advance. 

### Sharing of input and output files (`volumes`)

Because the working directory of the docker image is set by default to the current working directory, you can apply a command inside a docker image to files in the current working directory, and create files in it as well.

In [7]:
sh: docker_image='ubuntu:14.04'
  wc -l SoS_Docker_Guide.ipynb > docker_wc.txt
  
sh:
  cat docker_wc.txt

731 SoS_Docker_Guide.ipynb


This works because SoS automatically shares the current working directory of the host system to the docker image. Because the docker image can only "see" file systems shared by command `docker run`, your script will fail if your input files or output files are not under current working directory.

For example, the following script write something to a mobile harddrive (`/Volumes/Mobile` under mac)

In [8]:
sh:
  wc -l SoS_Docker_Guide.ipynb > /Volumes/Mobile

The script would fail to execute in a docker image because the image cannot see the `/Volumes` file system

In [9]:
%sandbox --expect-error
sh: docker_image='ubuntu:14.04'
  wc -l SoS_Docker_Guide.ipynb > /Volumes/Mobile

/var/lib/sos/docker_run_37713.sh: 1: /var/lib/sos/docker_run_37713.sh: cannot create /Volumes/Mobile: Directory nonexistent


Executing script in docker returns an error (exitcode=2).
The script has been saved to /var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpkdejiko3/.sos/docker_run_37713.sh. To reproduce the error please run:
``docker run --rm   -v /private/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpkdejiko3:/private/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpkdejiko3 -v /var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpkdejiko3/.sos/docker_run_37713.sh:/var/lib/sos/docker_run_37713.sh    -t -P -w=/private/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpkdejiko3 -u 1985961928:895809667    ubuntu:14.04 /bin/sh /var/lib/sos/docker_run_37713.sh``


The problem could be solved by specifying additional shared file systems through parameter `volumes`. This parameter accepts one (a string) or a list of volumes (list of strings) in the format of

* A single path (e.g. `/Users`) which would be shared to the docker image under the same name (e.g.  `/Users:/Users`).
* A full volume specification `host-src:]container-dest[:<options>]`, in which case host and container directories can have different names (e.g. `/Users:/home`).

A special rule here is that the current working directory will not be mapped separately if it is under one of the specified volumes. That is to say, if the current working directly is `/Users/bpeng1/project` and option `volumes='/Users:/home'` is specified, current working directory will be implicitly mapped to `/home/bpeng1/project`.

Consequently, if you would like to read input files from or write output files to another volume, you can add the paths to option `volumes`

In [10]:
sh: docker_image='ubuntu:14.04', volumes='/Volumes'
  wc -l SoS_Docker_Guide.ipynb > /Volumes/Mobile

sh:
  cat /Volumes/Mobile

731 SoS_Docker_Guide.ipynb


As another commonly used technique, some users prefer using "standard directories" as input and output directories of a script so that the scripts are more portable. For example, the following script maps source directory as `/input` and destination directory as `/output` and use `/input` and `/output` in the docker image:

In [11]:
sh: docker_image='ubuntu:14.04',
  volumes=['~/sos/sos-docs/src/tutorials:/input', '/Volumes:/output']
  wc -l /input/SoS_Docker_Guide.ipynb > /output/Mobile

sh:
  cat /Volumes/Mobile

731 /input/SoS_Docker_Guide.ipynb


Finally, as a word of caution, although it is tempting to share common directories (e.g. `/home`, `/Users`, `/Volumes` etc) to the docker image, sharing of extra directories can cause unexpected problems. For example, a docker image might contain a useful `/Users` directory and sharing host `/Users` will override the directory inside docker. 

Furthermore, sharing `$HOME` will expose a lot of user settings (e.g. settings under `~/.R`, `~/.sos`) to the docker image and might affect how the docker image runs. If you really need to expose a home directory to a docker image, you might want to expose it as a different directory inside of docker. For example, the following script shared home directory as `/data` inside docker so that it does not interfere with the home directory of the docker image.

In [12]:
sh: docker_image='ubuntu:14.04', volumes='~:/data'
  ls ~ | wc -l
  ls /data | wc -l

21
46


### Customize user and group ID (`user`)

By default SoS passes option `--user $(uid):$(gid)` to command `docker run` to execute the command as the same user as the user on host machine. This makes sure that the docker image has read/write permission to shared volumes and the files written by the docker image are readable by the host machine.

FIXME: example of using image user (`user=None`)
FIXME: example of using another user (`user=blah`)

In [13]:
docker_build: tag='test_docker_workdir'
  FROM ubuntu:14.04
  USER blah

0

### Docker images with `ENTRYPOINT`

Some docker images have an entry point which determines the command that will be executed when the image is executed. When such images are executed, parameters passed from command line will be appended to `ENTRYPOINT` so our usually way of specifying an interpreter (e.g. `ruby`) and a script will not work. If we run the script directly, our "command" (e.g. `ruby /var/lib/sos/docker_run_30258.rb` will be appended to the entry point and will not be executed properly. Examples of such images include [`dceoy/gatk`](https://hub.docker.com/r/dceoy/gatk/~/dockerfile/), which has an entry point

```
["java", "-jar", "/usr/local/src/gatk/build/libs/gatk.jar"]
```

and does not accept any additional interpreter. What we really need to do is to append "arguments" to this pre-specified command.

For example, the `test_docker_ls` image has an `ENTRYPOINT` with command `ls`.

In [14]:
docker_build: tag='test_docker_ls'
  FROM ubuntu:14.04
  ENTRYPOINT ["ls"]

0

The image is expected to be executed directly with optional parameter and without an interpreter (e.g. `docker run test_docker_ls`).

Because action `script` does not have a default interpreter, and option `args` can be used to construct a command line, we can use docker images with `ENTRYPOINT` in the format of

In [15]:
script: args = '-l SoS_Docker_Guide.ipynb', docker_image = 'test_docker_ls'

-rw-r--r-- 1 1985961928 895809667 24378 May 30 03:22 SoS_Docker_Guide.ipynb


which essentially passes `-l SoS_Docker_Guide.ipynb` to the image and executes command 
```
ls -l SoS_Docker_Guide.ipynb
```

If the command line is long, you can use another trick, that is to say, to use `{script}` in `args` for scripts of the action. For example, the aforementioned command can be specified as

In [16]:
script: args = '{script}', docker_image = 'test_docker_ls'
  -l SoS_Docker_Guide.ipynb

-rw-r--r-- 1 1985961928 895809667 24378 May 30 03:22 SoS_Docker_Guide.ipynb


## Limitations

* Virtual Box virtual machine does not support symbolic link so running `ln -s` inside a docker machine under Mac will cause a strange error message `Read-only file system`.
* Killing a sos task or sos process will not terminate scripts that are executed by the docker daemon.