# Copy This Notebook!

More than one person editing a single notebook is likely to cause conflicts. We encourage experimenting and playing along, so if you'd like to edit and run this notebook, please make a copy:

1. From the __File__ menu, select _Make a copy..._
2. The new copy will open.
3. From the __File__ menu, select _Rename..._
4. Provide a new name for your notebook, something that will distinguish it from other copies.
5. Start using your notebook!

# Containerizing Research Analyses

So far this morning we have discussed containerization with regard to application deployment and management. Another relevant use case is research - using containers to encapsulate dependencies and workflows for anlayses. The focus of this tutorial will be setting up and executing a simple analysis using RStudio and containerizing the result.

__NOTE:__ Using containers as a method for sharing reproducible research is not necessarily a best practice. As with many things, there are good reasons for and against and the method has its advocates and detractors. That said, we feel there is some utility in exploring data sharing and reproducibility as a container use case. The resources below are provided for anyone interested in a closer look at the practice and corresponding issues:

* Boettiger, Carl. "An introduction to Docker for reproducible research." _ACM SIGOPS Operating Systems Review_ 49.1 (2015): 71-79.
* Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). _The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences_. Oakland, CA: University of California Press.
* Hunger, Tom (2015). "Why Docker is not the answer to reproducible research, and why Nix may be." We are Wizards, blog post. https://blog.wearewizards.io/
* Brown, Titus (2012). "Virtual machines considered harmful for reproducibility." Living in an Ivory Basement, blog post. http://ivory.idyll.org/blog/vms-considered-harmful.html

## Create the Container

Rather than complete an analysis and containerize the result, we're going to start with a container that includes our data and the analysis platform (RStudio). The developers at _Rocker_ [https://github.com/rocker-org](https://github.com/rocker-org) have created a Docker image which includes R and, optionally, Rstudio. This simplifies things tremendously. 

For our data, we are going to return to the national farmer's market data provided by the USDA:

>US Department of Agriculture. (2018) _Farmers Markets Directory and Geographic Data_. USDA-4383. [https://catalog.data.gov/dataset/farmers-markets-geographic-data](https://catalog.data.gov/dataset/farmers-markets-geographic-data)

First let's step through the process of running a container. Because of differences across platforms regarding permissions required to install and run Docker, we have set up an online Docker service for today's demonstration. Access info will be provided during the workshop.

### Access Docker

1. Open a terminal (Windows CMD or Powershell, or the terminal app in Mac and Linux).
1. SSH into the Docker service using the credentials provided during the workshop.

### Running Containers

Within Docker, we run containers using images that are built either from a Docker file or, in many cases, by running a container. We can learn about the existing images and containers using a few commands.

```
# Get the version of Docker we are running

docker -v
Docker version 20.10.7, build f0df350

# Get a list of local images
# Note that many images are downloaded the first time we run them.

docker images
REPOSITORY       TAG       IMAGE ID       CREATED        SIZE
rocker/rstudio   latest    92a1e0b4f246   3 days ago     1.72GB
rocker/r-base    latest    c5993496410e   2 weeks ago    825MB
hello-world      latest    feb5d9fea6a5   6 months ago   13.3kB
```

Note that the ```rocker/rstudio``` image we need to run our RStudio container has already been added to the service.

```
# Get a list of local containers.
# Note the '-a' option is needed to include containers that aren't running.
# Also note the output may wrap onto additional line.
docker ps
CONTAINER ID   IMAGE            COMMAND   CREATED      STATUS      PORTS                                       NAMES
aa3cae08be89   rocker/rstudio   "/init"   2 days ago   Up 2 days   0.0.0.0:8788->8787/tcp, :::8788->8787/tcp   relaxed_brattain
bdcdd6ceaf2e   rocker/rstudio   "/init"   2 days ago   Up 2 days   0.0.0.0:8787->8787/tcp, :::8787->8787/tcp   wizardly_knuth

docker ps -a
CONTAINER ID   IMAGE            COMMAND    CREATED      STATUS                  PORTS                                       NAMES
aa3cae08be89   rocker/rstudio   "/init"    2 days ago   Up 2 days               0.0.0.0:8788->8787/tcp, :::8788->8787/tcp   relaxed_brattain
bdcdd6ceaf2e   rocker/rstudio   "/init"    2 days ago   Up 2 days               0.0.0.0:8787->8787/tcp, :::8787->8787/tcp   wizardly_knuth
6fc0585f89c0   hello-world      "/hello"   2 days ago   Exited (0) 2 days ago                                               objective_bohr
```

The output indicates that we already have two RStudio containers running. We will add more in a bit. First we can run a *hello world* type container using the following command:

```
docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/
```

If we run the ```docker ps -a``` command again, we see that rather than running the previously created ```hello-world``` container, a new one was created:

```
docker ps -a
CONTAINER ID   IMAGE            COMMAND    CREATED              STATUS                          PORTS                                       NAMES
ebeb77782b36   hello-world      "/hello"   About a minute ago   Exited (0) About a minute ago                                               elated_booth
aa3cae08be89   rocker/rstudio   "/init"    2 days ago           Up 2 days                       0.0.0.0:8788->8787/tcp, :::8788->8787/tcp   relaxed_brattain
bdcdd6ceaf2e   rocker/rstudio   "/init"    2 days ago           Up 2 days                       0.0.0.0:8787->8787/tcp, :::8787->8787/tcp   wizardly_knuth
6fc0585f89c0   hello-world      "/hello"   2 days ago           Exited (0) 2 days ago                                                       objective_bohr
```

Let's go ahead and run a container for which we don't already have an available image.

```
docker run -it ubuntu bash

Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
4d32b49e2995: Already exists
Digest: sha256:bea6d19168bbfd6af8d77c2cc3c572114eb5d113e6f422573c93cb605a0e2ffb
Status: Downloaded newer image for ubuntu:latest
root@1be4f19489f3:/#
```

Notice that the prompt changes. This is because we are now using the bash shell application within an Ubuntu container that was pulled and built when we ran the command ```docker run -it ubuntu bash```. The default Ubuntu image does not come pre-loaded with many utilities, but we can install them using the ```apt-get``` command.

```
# First run an update.
apt-get update

# Then install python
apt-get install python

# Launch a python interpreter in the shell.
python
Python 2.7.18 (default, Mar  8 2021, 13:02:45)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print "Hello world!"
Hello world!
>>> quit()

# Exit the Ubuntu container and return to the Docker service.
exit
```

For our current example, that's it!

The image can now be built using the following command, providing a namespace and image name where indicated. It may take some time, especially as we are adding some R packages. I typically use my NetID as the namespace:

```
docker build -t [namespace]/[image name] .

docker build -t jwheel01/rdemo .
```

The image build may take a few minutes. Once finished, we can run the container like so:

```
docker run -d -p 8787:8787 --name rdemo jwheel01/rdemo
```

As always, I like to check things before diving in. A useful command for making sure everything is running properly is

```
docker ps
```

or

```
docker ps -a
```

The output of ```docker ps``` should look something like:

```
CONTAINER ID   IMAGE               COMMAND     CREATED             STATUS              PORTS                    NAMES
12c8ccc324d1   jwheel01/rdemo      "/init"     6 seconds ago       Up 4 seconds        0.0.0.0:8787->8787/tcp   rdemo
```

## Accessing RStudio

Barring errors, we can now access RStudio with the preloaded dataset at [localhost:8787](localhost:8787)

The default is 'rstudio' for both username and password. Note that this can be changed as needed for security reasons.

![rstudio1](./images/scr1.png)

First, create a project then load some needed packages.

![rstudio2](./images/scr2.png)

Import the data from CSV using the 'readr' library.

![rstudio3](./images/scr3.png)

### Comparison of Farmer's Markets in the Four Corners area

This is a superficial comparison by count of farmer's markets. Once the dependencies and data have been imported, we can subset the data by state and add a summary count:

```
four_corners <- farmers_markets %>% 
                   filter(State == "New Mexico" | State == "Arizona" | State == "Utah" | State == "Colorado") %>% 
                   group_by(State) %>% 
                   summarise(market_count = sum(!is.na(State)))
```
![rstudio4](./images/scr4.png)

The following commands can be used to generate the plot below:

```
g <- ggplot(four_corners, aes(State))
g + geom_bar(aes(weight=market_count, fill=market_count)) + ggtitle("Four Corners Count of Farmer's Markets by State") + xlab("State") + ylab("Count of Farmer's Markets")
```

![four corners farmers markets by count](./images/4c_compare_count.png)

### Party planning in New Mexico

For a second, less superficial but still pretty superficial analysis, let's see where in NM we'd have the best options for hosting a party with farmer's market goods. 

First, let's go back and subset our data and then focus on the availability of different types of goods in NM.

```
nm <- farmers_markets %>% filter(State == "New Mexico")
```

Among the 59 variables, let's say we want to isolate some and combine others. We can create a summary table by county like so:

```
party_planning <- nm %>% group_by(County) %>% summarise(meat = sum((Meat=="Y")+(Poultry=="Y")+(Seafood=="Y")), beverages = sum((Wine=="Y")+(Juices=="Y")+(Coffee=="Y")), organics = sum(Organic=="Y"), tofu = sum(Tofu=="Y"))
```

![rstudio5](./images/scr5.png)

The following bit of extra code produces the image below:

```
party <- ggplot(party_planning, aes(x=organics, y=beverages, weight = County))

party + geom_point(aes(col=County), shape = 1, size = 2.5, stroke = 1.5) + xlim(1, 5) + ylim(1, 5) + geom_text_repel(aes(label=County), force=10) + ggtitle("Party Planning by County in NM") + xlab("Organic Food") + ylab("Beverage Selection")
```

![nm goods comparison](./images/nm_goods.png)

## Wrapping it up

We've published the article and contacted the state legislature to request more funding for farmer's markets in New Mexico, but we need to be able to share our analysis. We could do this many ways, including

* Publishing our Docker image
* Publishing our container

At this point, the image has the data and RStudio, but only the running container has the analysis and outputs. It is possible however to update the image with the additional installed libraries, etc. Let's do that:

```
docker images

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
jwheel01/rdemo      latest              67f5f05d1979        6 hours ago         1.09GB
rocker/rstudio      latest              b55aa8e98787        3 days ago          1.09GB
ubuntudocker14.04               d6ed29ffda6b        3 months ago        221MB
ubuntu              15.10               9b9cb95443b5        19 months ago       137MB

docker commit -m "completed demo" rdemo jwheel01/rdemo:completed

docker images

REPOSITORY           TAG                 IMAGE ID            CREATED             SIZE
jwheel01/rdemo       completed           d0ffdbbd8dcb        9 seconds ago       1.31GB
jwheel01/rdemo       latest              67f5f05d1979        6 hours ago         1.09GB
rocker/rstudio       latest              b55aa8e98787        3 days ago          1.09GB
ubuntu               14.04               d6ed29ffda6b        3 months ago        221MB
ubuntu
```

The image as modified can be pushed to a Docker repository such as DockerHub (be sure to use to correct tag!). Another option is to use the ```docker save``` command to save the image to a tar file which can be imported into another system using the ```docker load``` command.

```
docker save --output rdemo.tar jwheel01/rdemo:completed

docker load --input rdemo.tar
```

For now, we can verify the newly tagged image contains our analysis by running:

```
docker run -d -p 8787:8787 --name shared_demo jwheel01/rdemo:completed
```

and going to [localhost:8787](localhost:8787)

![rstudio6](./images/scr6.png)