# Copy This Notebook!

More than one person editing a single notebook is likely to cause conflicts. We encourage experimenting and playing along, so if you'd like to edit and run this notebook, please make a copy:

1. From the __File__ menu, select _Make a copy..._
2. The new copy will open.
3. From the __File__ menu, select _Rename..._
4. Provide a new name for your notebook, something that will distinguish it from other copies.
5. Start using your notebook!

# Containerizing Research Analyses

So far this morning we have discussed containerization with regard to application deployment and management. Another relevant use case is research - using containers to encapsulate dependencies and workflows for anlayses. The focus of this tutorial will be setting up and executing a simple analysis using RStudio and containerizing the result.

__NOTE:__ Using containers as a method for sharing reproducible research is not necessarily a best practice. As with many things, there are good reasons for and against and the method has its advocates and detractors. That said, we feel there is some utility in exploring data sharing and reproducibility as a container use case. The resources below are provided for anyone interested in a closer look at the practice and corresponding issues:

* Boettiger, Carl. "An introduction to Docker for reproducible research." _ACM SIGOPS Operating Systems Review_ 49.1 (2015): 71-79.
* Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). _The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences_. Oakland, CA: University of California Press.
* Hunger, Tom (2015). "Why Docker is not the answer to reproducible research, and why Nix may be." We are Wizards, blog post. https://blog.wearewizards.io/
* Brown, Titus (2012). "Virtual machines considered harmful for reproducibility." Living in an Ivory Basement, blog post. http://ivory.idyll.org/blog/vms-considered-harmful.html

## Create the Container

Rather than complete an analysis and containerize the result, we're going to start with a container that includes our data and the analysis platform (RStudio). The developers at _Rocker_ [https://github.com/rocker-org](https://github.com/rocker-org) have created a Docker image which includes R and, optionally, Rstudio. This simplifies things tremendously. 

For our data, we are going to return to the national farmer's market data provided by the USDA:

>US Department of Agriculture. (2018) _Farmers Markets Directory and Geographic Data_. USDA-4383. [https://catalog.data.gov/dataset/farmers-markets-geographic-data](https://catalog.data.gov/dataset/farmers-markets-geographic-data)

The data are available in CSV format in the _data_ directory, so in addition to pulling an RStudio Docker image from the Rocker repository, our Dockerfile will also copy the data into the container and update permissions as needed. We note it is also possible and in some cases preferable to pull data from URL or other online resource when the image is built.

The Dockerfile is already provided, but the contents are provided below:

For our current example, that's it!

The image can now be built using the following command, providing a namespace and image name where indicated. I typically use my NetID as the namespace:

```
docker build -t [namespace]/[image name] .

docker build -t jwheel01/rdemo .
```

The image build may take a few minutes. Once finished, we can run the container like so:

```
docker run -d -p 8787:8787 --name rdemo jwheel01/rdemo
```

As always, I like to check things before diving in. A useful command for making sure everything is running properly is

```
docker ps
```

or

```
docker ps -a
```

The output of ```docker ps``` should look something like:

```
CONTAINER ID   IMAGE               COMMAND     CREATED             STATUS              PORTS                    NAMES
12c8ccc324d1   jwheel01/rdemo      "/init"     6 seconds ago       Up 4 seconds        0.0.0.0:8787->8787/tcp   rdemo
```

## Accessing RStudio

Barring errors, we can now access RStudio with the preloaded dataset at [localhost:8787](localhost:8787)

The default is 'rstudio' for both username and password. Note that this can be changed as needed for security reasons.

![rstudio1](./images/scr1.png)

First, create a project then install and load some needed packages - 'dplyr,' 'gglot2,' and 'ggrepel'. This may take some time.

![rstudio2](./images/scr2.png)

Import the data from CSV using the 'readr' library.

![rstudio3](./images/scr3.png)

### Comparison of Farmer's Markets in the Four Corners area

This is a superficial comparison by count of farmer's markets. Once the dependencies and data have been imported, we can subset the data by state and add a summary count:

```
four_corners <- farmers_markets %>% 
                   filter(State == "New Mexico" | State == "Arizona" | State == "Utah" | State == "Colorado") %>% 
                   group_by(State) %>% 
                   summarise(market_count = sum(!is.na(State)))
```
![rstudio4](./images/scr4.png)

The following commands can be used to generate the plot below:

```
g <- ggplot(four_corners, aes(State))
g + geom_bar(aes(weight=market_count, fill=market_count)) + ggtitle("Four Corners Count of Farmer's Markets by State") + xlab("State") + ylab("Count of Farmer's Markets")
```

![four corners farmers markets by count](./images/4c_compare_count.png)

### Party planning in New Mexico

For a second, less superficial but still pretty superficial analysis, let's see where in NM we'd have the best options for hosting a party with farmer's market goods. 

First, let's go back and subset our data and then focus on the availability of different types of goods in NM.

```
nm <- farmers_markets %>% filter(State == "New Mexico")
```

Among the 59 variables, let's say we want to isolate some and combine others. We can create a summary table by county like so:

```
party_planning <- nm %>% group_by(County) %>% summarise(meat = sum((Meat=="Y")+(Poultry=="Y")+(Seafood=="Y")), beverages = sum((Wine=="Y")+(Juices=="Y")+(Coffee=="Y")), organics = sum(Organic=="Y"), tofu = sum(Tofu=="Y"))
```

![rstudio5](./images/scr5.png)

The following bit of extra code produces the image below:

```
party <- ggplot(party_planning, aes(x=organics, y=beverages, weight = County))

party + geom_point(aes(col=County), shape = 1, size = 2.5, stroke = 1.5) + xlim(1, 5) + ylim(1, 5) + geom_text_repel(aes(label=County), force=10) + ggtitle("Party Planning by County in NM") + xlab("Organic Food") + ylab("Beverage Selection")
```

![nm goods comparison](./images/nm_goods.png)

## Wrapping it up

We've published the article and contacted the state legislature to request more funding for farmer's markets in New Mexico, but we need to be able to share our analysis. We could do this many ways, including

* Publishing our Docker image
* Publishing our container

At this point, the image has the data and RStudio, but only the running container has the analysis and outputs. It is possible to update the image with the additional installed libraries, etc. For today we will save both to tar files:

```
docker images

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
jwheel01/rdemo      latest              67f5f05d1979        3 hours ago         1.09GB
rocker/rstudio      latest              b55aa8e98787        3 days ago          1.09GB

docker save --output rdemo_image.tar 67f5f05d1979

docker ps -a
CONTAINER ID    IMAGE               COMMAND         CREATED             STATUS                     PORTS         NAMES
12c8ccc324d1    jwheel01/rdemo      "/init"         3 hours ago         Exited (0) 5 seconds ago                 rdemo

docker container export --output rdemo_completed.tar rdemo
```

Images saved this way can be accessed using the ```docker load``` command.

Containers can be untarred and run using ```import``` and ```run```.