# Optimize your Docker Infrastructure with Python

## PyData NYC

November 11, 2015

Ryan J. O'Neil  
<ryanjoneil@gmail.com>  

### Intro

By Day: Lead Engineer @ Yhat, Inc.  

Formerly:
* Simulation, modeling, optimization @ MITRE
* Data Journo @ The Washington Post

By Night: PhD Candidate in SEOR @ George Mason University

Research:
* Combinatorial optimization
* Scheduling problems
* Cutting & packing problems

### Motivation

Consider the DevOps Engineer.

#### It's a noble group...

![](images/scotty.jpg)

### ...with grave responsibilities.

One of the important functions of DevOps is the creation of environments for:

* Software development
* Testing & quality assurance
* Running operational systems
* Whatever else you might need a computing environment for

#### There's even a Venn diagram about DevOps.

And we all know how much we like Venn diagrams.

![](images/devops-venn.svg)

#### A typical DevOps function

Say you work on _Yet Another Enterprise Java Application (TM)_. YAEJA generates lots of \$\$\$\$ for _Yet Another Enterprise Software Company_ and it keeps you gainfully employed!

Say you need to upgrade portions of the system YAEJA it runs _(and depends)_ on.

Youo don't want to just upgrade the production instances without testing first. So you need two environments from your DevOps person.

One for running the production instance:

|     | Command                                                        |
|-----|----------------------------------------------------------------|
| $A$ | Install the Java compiler and runtime environment.             |
| $B$ | Download a set of external dependencies.                       |
| $C$ | Set up a an EntepriseDB (TM) schema and populate it with data. |

And another for testing it with the new version of the utility:

|     | Command                                                        |
|-----|----------------------------------------------------------------|
| $A$ | Install the Java compiler and runtime environment.             |
| $B$ | Download a set of external dependencies.                       |
| $C$ | Set up a an EntepriseDB (TM) schema and populate it with data. |
| $D$ | Update the underlying system utility.                          |

If _Yet Another Enterprise Java Application (TM)_ integrates with a number of optional third party applications, you might need an additional test environment for each one.

If those interact in interesting ways, you may need test environments for different combinations of third party integrations.

#### Looks like your DevOps person will be working all weekend

![](images/saturday.png)

#### In the old days...

At a big software shop, devops may be responsible for the continual setup and teardown of dozens (or _hundreds_) of system configurations for a single software project.

This used to be done with physical hardware on _(sometimes)_ fresh operating system installs.

Most medium-to-large software shops had an air conditioned room that looked like this.

![](images/cable-spaghetti.jpg)

#### Environmental setup was often pretty manual

If you had to reproduce an environment, you had a few options:

* Start with a fresh install of the operating system


* Save the results of your setup to a CD and load that onto a box


* Hope that uninstalling and resintalling the relevant components is good enough  

    + A _lot_ of people did this
    + It's probably not good enough
    + Most software leaves behind relics
        - Logs...
        - Data files...
        - Configuration...

#### Nowadays...

![](images/cloud-docker.png)

### What's a container?

Docker is so hot right now.

Containers are lightweight virtualization. They make it seem like a process is in it own operating system on its own hardware, without loading up heavy stuff like a kernel.

Container architectures have bveen around since `chroot` jails in V7 Unix.


They're not exactly _new_...

But now they're so convenient they feel _(and are raising venture capital)_ that way!

#### A tiny bit of history

* 1979: `chroot` jails added to System 7 Unix at Bell Labs
    + Last version of Unix before it was commercializated by AT&T
    + Ran on a DEC PDP-11 minicomputer

* 1982: Bill Joy ports `chroot` to BSD

* 2005: Sun releases Solaris Containers
    + Zones provide fully isolated virtual servers on a single host

* 2007: Initial implementation of `cgroups` by Google for Linux Kernel 2.6.24
    + Isolation of system resources

* 2013: **Namespace isolation** introduced in Linux Kernels 3.15 & 3.16
    + Process IDs
    + Network interfaces, iptables, routing
    + Inter-Process Communication
    + etc...

#### Containers are about isolation...

Processes and system resources behave as if the are on their own computers.

##### Container 1

```
[ryan@localhost ~]$ docker run -it ubuntu:trusty /bin/bash
root@19867869f71d:/# echo spam and eggs > /ingredients.txt
root@19867869f71d:/# cat /ingredients.txt                         
spam and eggs
```

##### Container  2

```
[ryan@localhost ~]$ docker run ubuntu:trusty cat /ingredients.txt
cat: /ingredients.txt: No such file or directory
```

They have their own process spaces.

```
[ryan@localhost ~]$ docker run --cidfile=cid -it ubuntu:trusty
root@fc347d97db8c:/# echo $$
1
```

And they are convinced they have their own hardware resources.

```
[ryan@localhost ~]$ docker stats --no-stream=true $(cat cid)

CONTAINER
fc347d97db8c7d02b870eff3e2d1e92747e8c0fc1cb7f9b1c76bf534fcd21ba0

CPU %               MEM USAGE/LIMIT     MEM %               NET I/O
0.00%               524.3 kB/7.945 GB   0.01%               648 B/648 B
```

#### ...but containers are also about sharing.

And this is what we care about here.

Specifically, saving and retrieving the results of a computation from the Docker image cache.

Why?

Smart cache use == time saved building out environments!

![](images/scotty-approved.jpg)

### Docker cache mechanics

Maybe I should call this section UnionFS mechanics, but then Docker is so hot right now.

#### A Tale of Two Dockerfiles

Let's say I collect old or non-standard revision control systems.

Everyone needs a hobby.

I set up an environment with a few RCSs in it to play around with.

##### Dockerfile: rcs1

```
FROM ubuntu:trusty
RUN apt-get update
RUN apt-get install -y bzr
RUN apt-get install -y cvs
```

#### Let's build out our non-standard RCS image!

```
[ryan@localhost dockerfiles]$ docker build --file=rcs1 --tag=rcs1 .
```

Step 0 would download the base Ubuntu image, but I've already got it.

```
Sending build context to Docker daemon 24.58 kB
Step 0 : FROM ubuntu:trusty
 ---> a5a467fddcb8
```

`bzr` require a number of dependencies...

```
Step 1 : RUN apt-get install -y bzr
 ---> Running in 7679f0a1525e
Reading package lists...
Building dependency tree...
Reading state information...
The following extra packages will be installed:
  ca-certificates dbus gir1.2-glib-2.0 libapparmor1 libassuan0
  libdbus-glib-1-2 libgirepository-1.0-1 libglib2.0-0 libglib2.0-data libgmp10
  [...snip...]
```

As does `cvs`.

```
Step 2 : RUN apt-get install -y cvs
 ---> Running in 54229eef94d2
Reading package lists...
Building dependency tree...
Reading state information...
The following extra packages will be installed:
  krb5-locales libedit2 libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3
  libkrb5support0 libx11-6 libx11-data libxau6 libxcb1 libxdmcp6 libxext6
  [...snip...]
```

At the end of this I have a tagged image. I can easily use this to spin up new containers.

```
[ryan@localhost dockerfiles]$ docker run rcs1 bash -c "which cvs && which bzr"
/usr/bin/cvs
/usr/bin/bzr
```

#### Oh snap.

I found an old RCS repository from the 1980s I want to look at.

![](images/rcs-sccs.jpg)

I don't have `rcs` installed. I guess I'll have to build a new image.

##### Dockerfile: rcs2

```
FROM ubuntu:trusty
RUN apt-get install -y bzr
RUN apt-get install -y cvs
RUN apt-get install -y software-properties-common
RUN add-apt-repository "deb http://archive.ubuntu.com/ubuntu trusty universe"
RUN apt-get update
RUN apt-get install -y rcs
```

```
[ryan@localhost dockerfiles]$ docker build --file=rcs2 --tag=rcs2 .
```


The first few steps have already been run before and can use the Docker cache!
```
Step 0 : FROM ubuntu:trusty
 ---> a5a467fddcb8

Step 1 : RUN apt-get install -y bzr
 ---> Using cache
 ---> ed8a8efd04cc

Step 2 : RUN apt-get install -y cvs
 ---> Using cache
 ---> 61fe05c24a19
```

After that it's business as usual.

```
Step 3 : RUN apt-get install -y software-properties-common
 ---> Running in 4adc2bb63a73
The following extra packages will be installed:
  iso-codes libasn1-8-heimdal libcurl3-gnutls libgssapi3-heimdal
  libhcrypto4-heimdal libheimbase1-heimdal libheimntlm0-heimdal
  [...snip...]

Step 4 : RUN add-apt-repository "deb http://archive.ubuntu.com/ubuntu trusty universe"
 ---> Running in 5a023435448c
 ---> 02439f270099
Removing intermediate container 5a023435448c

Step 5 : RUN apt-get update
 ---> Running in 449c25ee3537
Get:1 http://archive.ubuntu.com trusty-updates InRelease [64.4 kB]
[...snip...]

Step 6 : RUN apt-get install -y rcs
 ---> Running in 3574b94bbea3
The following NEW packages will be installed:
  rcs
```

#### Adding to my stable of random revision control.

_Also, why are they all three letters long?_

So now I have `rcs` in addition to `bzr` and `cvs`. Note that my `rcs2` image was built off of my `rcs1` image. I didn't have to do any extra work to recreate those shared parts of it.

```
[ryan@localhost dockerfiles]$ docker run rcs2 bash -c "which bzr && which cvs && which rcs"
/usr/bin/bzr
/usr/bin/cvs
/usr/bin/rcs
```

#### What happens if I rebuild the image again?

Omigosh everything is cached!

```
[ryan@localhost dockerfiles]$ docker build --file=rcs2 --tag=rcs2 .
Sending build context to Docker daemon 3.072 kB
Step 0 : FROM ubuntu:trusty
 ---> a5a467fddcb8
Step 1 : RUN apt-get install -y bzr
 ---> Using cache
 ---> ed8a8efd04cc
Step 2 : RUN apt-get install -y cvs
 ---> Using cache
 ---> 61fe05c24a19
Step 3 : RUN apt-get install -y software-properties-common
 ---> Using cache
 ---> 942207de6f7f
Step 4 : RUN add-apt-repository "deb http://archive.ubuntu.com/ubuntu trusty universe"
 ---> Using cache
 ---> 02439f270099
Step 5 : RUN apt-get update
 ---> Using cache
 ---> 220536dfc3ff
Step 6 : RUN apt-get install -y rcs
 ---> Using cache
 ---> 1a8cd22cb14d
Successfully built 1a8cd22cb14d
```

![](images/spock-logical-awesome.jpg)

#### What if I change the order of that Dockerfile?

I'd really rather have the commands to add `universe` at the top.

##### Dockerfile: rc3

```
FROM ubuntu:trusty
RUN apt-get install -y software-properties-common
RUN add-apt-repository "deb http://archive.ubuntu.com/ubuntu trusty universe"
RUN apt-get update
RUN apt-get install -y bzr
RUN apt-get install -y cvs
RUN apt-get install -y rcs
```

#### What happens when I build an image with the commands reordered?

Nothing except the base image is cached!

```
[ryan@localhost dockerfiles]$ docker build --file=rcs3 --tag=rcs3 .
Sending build context to Docker daemon 4.096 kB

Step 0 : FROM ubuntu:trusty
 ---> a5a467fddcb8

Step 1 : RUN apt-get install -y software-properties-common
 ---> Running in 713e49f1f323
Reading package lists...
Building dependency tree...
Reading state information...
The following extra packages will be installed:
  ca-certificates gir1.2-glib-2.0 iso-codes krb5-locales libasn1-8-heimdal
  libcurl3-gnutls libdbus-glib-1-2 libgirepository-1.0-1 libglib2.0-0
  libglib2.0-data libgssapi-krb5-2 libgssapi3-heimdal libhcrypto4-heimdal
  libheimbase1-heimdal libheimntlm0-heimdal libhx509-5-heimdal libidn11
  libk5crypto3 libkeyutils1 libkrb5-26-heimdal libkrb5-3 libkrb5support0
  libldap-2.4-2 libroken18-heim
  [...etc...]
```

#### So Docker can use the cache when...

It has seen the exact same commands in _the same order_.

If it sees something new, it stops using the cache.

Although certain commands, like `COPY`, break the cache entirely.

#### Docker caching examples

Building two images that share three commands:

![](images/fig01.png)

If we structure them as such, they share nothing:

![](images/fig02.png)

#### Once images diverge, they can never merge back together

This won't work:

![](images/fig03.png)

This is what would really happen:

![](images/fig04.png)

This would be better:

![](images/fig05.png)

### Problem statement

Let's write this thing in LaTeX!

#### What are we trying to do again?

Given a set of Docker images to construct, each of which requires a series of commands, find the optimal order to run these commands to maximize use of the Docker cache.

_Could also read: to minimize required computing resources._

### Partitioning sets with PuLP

We'll start with a really easy NP-Complete problem...

### Finding maximal cliques with NetworkX

...add another NP-Complete subproblem...

### Model construction

...and then we'll solve both of them.

### Results

No, really. That's how we do this thing.