# Containerization for Reproducible Bioinformatics Research


### Lessons from the NCI Cloud Resources and Hackathons

Steve Tsang, NCI-CBIIT<br>
Prepared for NLM Reproducibility Workshop<br>
9/4/2018 11-12PM<br>

## Disclaimer

The opinions/comments/assessment expressed in this presentation are the author's own and do not necessarily reflect the view of the National Cancer Institute or National Institutes of Health.
<br><br>https://ethics.od.nih.gov/topics/Disclaimer.htm


## Reproducibility and Containerization

![](screenshots/spectrum.png)

## "Reproducibility software containers" on Pubmed

In [1]:
%matplotlib widget
#%matplotlib notebook

In [2]:
#from pandas import Series
#from matplotlib import pyplot
#series = pd.read_csv('pubmed_result.csv', header=1)
#print(series.head())


import pandas as pd
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv('pubmed_result.csv', header=0)
#print(df.head())
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
(df.loc[df['Date'].dt.year.between(1989, 2018), 'Date']
       .groupby(df['Date'].dt.year)
       .count()
       .plot(kind="bar")
)

plt.ylabel("Number of Publications")
plt.xlabel("Year")
plt.title("Pubmed Search Results")
plt.show()



FigureCanvasNbAgg()

<img src="screenshots/pubmedplot.png" width="800">


![](screenshots/timeline.png)
<center>Image Produced on Seven Bridges Cancer Genomics Cloud<br></center>
<center>Workflow inspired by https://github.com/wilke/CWL-Quick-Start</center>

## Reproducibility is Challenging

![](screenshots/learnhackathon.png)

## Cancer Genomic Data Challenges

![](screenshots/challenges.png)

## NCI Cloud Resources Concept

![](screenshots/Concept.png)

## NCI Cloud Resources

![](screenshots/CR.png)

![](screenshots/CRDC.png)


## F.A.I.R Guiding Principles

![](screenshots/fair.png)

## "Reproducibility Crisis"

<center><img src="screenshots/fairanalysis.png" width="800"></center>

## Containerization Technology 

<img src="screenshots/containercloud.png" width="1200">

## Docker Concept

<center><img src="screenshots/docker.png" width="800"></center>

## Docker's Layered Filesystem

In [3]:
!more Dockerfile  

FROM ubuntu:16.04
RUN apt-get update
RUN apt-get install -y python3


```{}
docker build -t sampleimage .
```
Step 1/3 : FROM ubuntu:16.04
 ---> 52b10959e8aa
Step 2/3 : RUN apt-get update
...
 ---> 3502c5bdd18a
Step 3/3 : RUN apt-get install -y python3
...
 ---> e0aac1c590b7
Successfully built e0aac1c590b7
Successfully tagged sampleimage:latest

## Dockerfile

![](screenshots/dockerfile.png)

```()
docker build -t stevetsa/kallisto:latest .
docker push stevetsa/kallisto:latest

### push to GitHub and auto-build on Dockerhub
docker pull stevetsa/kallisto:latest
docker run -v `pwd`:`pwd` -w `pwd` -i -t stevetsa/kallisto
```

## Sharing Docker-based Tools

<img src="screenshots/dockstore.png" width="1200">

## Use Case 1

<img src="screenshots/usecase1.png">

<img src="screenshots/nastybugsworkflow.png">

<img src="screenshots/nastybugsreviewer.png">

<img src="screenshots/nastybugsgithub.png">

<img src="screenshots/nastybugsdockerhub.png">

## Use Case 2

<img src="screenshots/usecase2.png">

August 2017 NCBI Hacakthon - RNAseq viewer<br><br>
The goal of this project is to leverage web technologies to build a modular gene expression viewer for large-scale, complex experiments. The data included in this repo is just a sample of what can be achieved with this scheme by using Django and Polymer for optimal performance, ease of use, and consistency.





<img src="screenshots/viewer1.png" width="800">

<img src="screenshots/viewer2.png" width="800">

```{}
docker run -itp 8000:8000 stevetsa/gea-image
## in browser
http://127.0.0.1:8000/genvis/ideogram
```

## Use Case 3

<img src="screenshots/usecase3.png">


![](screenshots/UCSD1.png)

<img src="screenshots/UCSD2.png" width="800">

In [8]:
import py3Dmol
viewer = py3Dmol.view(query='pdb:1OHR')
viewer.setStyle({'chain': 'A'}, {'cartoon': {'color': 'blue'}})
viewer.setStyle({'chain': 'B'}, {'cartoon': {'color':' yellow'}})
viewer.addSurface(py3Dmol.MS,{'opacity':0.9,'color':'lightblue'},{'chain': 'A', 'hetflag' : False})                   
viewer.addSurface(py3Dmol.MS,{'opacity':0.9,'color':'lightyellow'},{'chain': 'B', 'hetflag' : False})
viewer.show()

In [10]:
ligand = {'resn': '1UN'}
viewer.setStyle(ligand, {'stick':{'radius': 0.3, 'singleBond': False}})
viewer.addLabel('Nelfinavir', {'fontColor':'black', 'backgroundColor':'lightgray'},ligand)
viewer.zoomTo(ligand)
viewer.show()

## Discussions

## Will the same Dockerfile always produce identical images?

- Defining reproducibility
- Containerization allows you to run legacy software
    - Reproducibility vs “security”
- Containerization simplifies the process to run software
- Containerization provides an isolated environment for testing 
- Sharing images/containers
    - Samtools in Dockerhub - defining “identical” images for tools and workflows
- Tool documentation
    Best practices (e.g. Dockerfiles) to minimize trial and error
- Training/education



## NCI Containers and Workflows Interest Group
[Discussions and Meetings](https://goo.gl/gccfB7)

-  Initiate cross-NCI strategy to:
    -  facilitate scientific computing standards, guidelines & best practices
    -  share methods to promote reproducible science
    -  democratize computational research and benefit the community using these methods
-  Discuss approaches and possible technical solutions for describing scientific workflows and sharing containerized tools developed by NCI-funded programs




-  Build a community of practice and discussing relevant topics in container and workflow technologies
-  Approx. 130 members joined since Sept 2016
-  Monthly meetings - https://goo.gl/gccfB7
-  Presentations/Lectures
    -  Survey rapidly-evolving fields of container and workflow technologies and invite outside experts to inform and educate members of the Working Group
    -  Use cases from Cloud Resources; Community efforts  - GA4GH challenge, Dockstore, BioContainers, CWL; various scientific domains - genomics, microbe, neuroscience, imaging, etc. 

## Additional Resources 

- [Bioconda](https://bioconda.github.io/)
- [Bioconductor Docker Containers](https://www.bioconductor.org/help/docker/)
- [BioContainers](https://biocontainers.pro/)
- [Bioboxes](http://bioboxes.org/)
- [NCBI Base Images](https://github.com/NCBI-Hackathons/HackathonBaseImages)

Awesome Containers - <br>
[Awesome Containers](https://github.com/tcnksm/awesome-container); [Awesome Linux Containers](https://github.com/Friz-zy/awesome-linux-containers); [Awesome Docker](https://github.com/veggiemonk/awesome-docker)

Container Registries - <br>
[Docker Hub](http://www.dockerhub.com/);  [Quay.io](https://quay.io/repository/); [Dockstore](https://dockstore.org/);  [Singularity Hub](https://www.singularity-hub.org);  [Google Container Registry](https://cloud.google.com/container-registry/); [AWS Container Registry](https://aws.amazon.com/ecr/);  [Azure Container Registry](https://azure.microsoft.com/en-us/services/container-registry/);  [Seven Bridges Image Registry](https://docs.sevenbridges.com/docs/the-image-registry);  [GitLab Container Registry](https://about.gitlab.com/2016/05/23/gitlab-container-registry/);  [DGX/Nvidia Container Registry](https://www.nvidia.com/en-us/gpu-cloud/deep-learning-containers/)



## Acknowledgement

![](screenshots/ack.png)

## About this presentation...

# Publishing on binder

In order for a binder-hosted notebook to start in slideshow mode, you need to have the following tag set in the notebook metadata:

    ...
    "rise": {
            "autolaunch": true
            }
    ...

You can edit the notebook metadata from the `Edit` menu, submenu `Edit notebook metadata`.

Note finally that the `rise` key in this JSON file used to be named `livereveal`. The latter is still honored, but the former takes precedence, and it is recommended to use only `rise` from now on.