# Reproducibility and exploratory computing with a Jupyter-based workflow
*Antonino Ingargiola*, EuroScipy 2018

# JupyterCon Talk: "I don't like notebooks"

In [1]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">slides for my &quot;I Don&#39;t Like Notebooks&quot; <a href="https://twitter.com/hashtag/JupyterCon?src=hash&amp;ref_src=twsrc%5Etfw">#JupyterCon</a> talk:<a href="https://t.co/30peBFwTbv">https://t.co/30peBFwTbv</a></p>&mdash; Joel Grus (@joelgrus) <a href="https://twitter.com/joelgrus/status/1033035196428378113?ref_src=twsrc%5Etfw">August 24, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [2]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Notebook development guidelines at <a href="https://twitter.com/netflixdata?ref_src=twsrc%5Etfw">@netflixdata</a> :<br><br>* Keep a low branching factor<br>* Short and simple is better<br>* Keep to one primary outcome<br>* Leave library functions in libraries<br>* Move complexity to libraries<br><br> <a href="https://twitter.com/codeseal?ref_src=twsrc%5Etfw">@codeseal</a> at <a href="https://twitter.com/hashtag/jupytercon?src=hash&amp;ref_src=twsrc%5Etfw">#jupytercon</a></p>&mdash; Caitlin Hudon👩🏼‍💻 (@beeonaposy) <a href="https://twitter.com/beeonaposy/status/1032693394965975040?ref_src=twsrc%5Etfw">August 23, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

# An opinionated workflow ...

## 1. Prepare the environment

- Use one folder for the project
- Put folder under auto backup (Dropbox, ...)
- Create a separate environment (pipenv, conda, docker)
    - Save specs (one of those):
            - requirements.txt
            - environment.yml
            - dockerfile
    

### Example: create a conda environment

.
```
(base)$ cd my_project
(base)$ conda create -n my_project_env python=3.6 scikit-learn seaborn
(base)$ conda activate my_project_env
(my_project_env)$
```


### Folder structure

.
```
    project_1/
    
        data/
        results/
        figures/
        
        analysis.ipynb
        environment.yml
```

## 2. Prepare the Data

- Transform your data to a common format

**Example:** *tidy* vs *long-form*:

![](images/tidyr-spread-gather.gif)

## 3. Notebook narrative

- Narrative: 
    - *headings, equations, links, figures*

- Document purpose not mechanics

- Print libraries's versions

## 4. Notebook (code) structure


- Use a linear structure (low branching factor)

- Short and simple is better

- Keep one primary outcome

- Do "Restart + Run-All" often

## A starting project
*1-notebook*

![](images/empty_notebook.png)

## A starting project
*1-notebook*

- Remove duplication (DRY)

- Write **functions**

- Add assertion-based **tests**

## Maturing Project
*(1-2 notebooks)*

- Save intermediate results (example: CSV, figures)

- Consolidate functions into modules:
    - Move functions to `.py` file
    - Move tests to `test_*.py` file, run it with `pytest`
    - Commit `.py` files

- For short projects, commit the full notebooks
- Use `nbdime` for diffs

In [3]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">I had such an outdated nbdime! Now a button shows a graphical diff  between an open notebook and its latest git-committed version with one  click.    🎉  <br><br>Install or update nbdime, then follow these simple instructions to enable the extension:<a href="https://t.co/D1pwoP7lAT">https://t.co/D1pwoP7lAT</a> <a href="https://t.co/9yJnYBsMyF">pic.twitter.com/9yJnYBsMyF</a></p>&mdash; Antonino Ingargiola (@tritemio_sc) <a href="https://twitter.com/tritemio_sc/status/1034075881889849344?ref_src=twsrc%5Etfw">August 27, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


## Maturing Project (2)
*a few notebooks*

Notebooks start to be consolidated.

- Several notebooks share the same `.py` files

- If notebooks are similar, **parametrize**

- Batch run:
    - Save notebooks with no output (**template**)
    - Save executed notebooks in sub-folder
    - Commit template notebooks

### Example:

.
    project_1/
    
        data/
            experiment1.hdf5
            experiment2.hdf5
            experiment3.hdf5
            
        results/
            results_experiment1.csv
            results_experiment2.csv
            results_experiment3.csv
            
        reports/
            analysis_experiment1.ipynb
            analysis_experiment2.ipynb
            analysis_experiment3.ipynb
            
        index.ipynb
        analysis_template.ipynb
        ...
        loader.py
        analysis.py
        plots.py

## Mature project: pipeline
*many notebooks*

- Build a **master notebook** for:

    1. Batch-run the full pipeline
        - *parametrized notebooks*
    2. Aggregate results from multiple notebooks:
        - *figures, CSV, etc...*


## Parametrize Notebooks


![](images/screenshot.png)


## Parametrize notebooks

- [nbrun](https://github.com/tritemio/nbrun):
    - from dearly yours
    - a single-function `run_notebook()`
    - vendorize in your project

- [papermill](https://papermill.readthedocs.io/en/latest/)
    - from Netflix
    - graphical parametrization
    - save data into notebooks
    - retrieve data, figures from notebooks for summaries

## Functions -> Libraries

When used by more than one project:

- Create a Python package
- Use `versioneer`: auto-version based on git commit:

                     commit   
                    ,-----,
            0.7+12.g790526b
            /    \
      last tag   # of commits after tag '0.7'

In [4]:
import skopt
skopt.__version__

'v0.5.2+36.g5bba7b9'

In [5]:
import numpy as np
import scipy
import skopt
print('Numpy:', np.__version__)
print('Scipy:', scipy.__version__)
print('Scikit-optimize:', skopt.__version__)

Numpy: 1.14.2
Scipy: 1.0.0
Scikit-optimize: v0.5.2+36.g5bba7b9


In [6]:
import fretbursts

 - Optimized (cython) burst search loaded.
 - Optimized (cython) photon counting loaded.
--------------------------------------------------------------
 You are running FRETBursts (version 0.7+0.g790526b.dirty).

 If you use this software please cite the following paper:

   FRETBursts: An Open Source Toolkit for Analysis of Freely-Diffusing Single-Molecule FRET
   Ingargiola et al. (2016). http://dx.doi.org/10.1371/journal.pone.0160716 

--------------------------------------------------------------


# Conclusions

- Curate "environments"
- Build narrative
- Move complexity to libraries
- Automate execution

In [7]:
%%HTML
<img src="images/Thankyou_slide.png" alt="Thank you. Antonino Ingargiola." 
height="1024" width="768"> 

# Backup

In [8]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">In light of recent discussions, here&#39;s a series of videos I made a while ago that shows my approach to reproducible data analysis in the Jupyter notebook: <a href="https://t.co/UdLAhx4jWq">https://t.co/UdLAhx4jWq</a></p>&mdash; Jake VanderPlas (@jakevdp) <a href="https://twitter.com/jakevdp/status/1034617901389496320?ref_src=twsrc%5Etfw">August 29, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>