# Reproducible science  as open-source science

## The reproducibility/replicability crisis
* A cumulative science depends on the ability to produce the same result under the same conditions
* Reproducibility vs. replicability
* Scientists are generally not very good at either of these
* Why?

## Some solutions
* Better reporting standards
* Better statistical methods (confidence intervals, Bayesian inference, etc.)
* Preregistration
* Change the incentive structure
* Etc. etc.
* Most of these are specific to science, or even specific scientific fields

## Reproducibility is arguably a solved problem
* Scientists are not unique in their need for reproducibility
* Many/most of our reproducibility problems are actually software problems
* A number of best practices that are ubiquitous in (open-source) software developers would solve many problems in scientific reproducibility

## 1. Automate everything*
* Point-and-click analysis is almost by definition irreproducible
* Log/record as much as possible
* Analysis should ideally be [completely programmatic](http://nbviewer.jupyter.org/github/WagnerLabPapers/Waskom_JNeurosci_2014/blob/master/Behavioral_and_Decoding_Analyses.ipynb)
* There are different degrees of automation, of course

## 2. Version control
* We change stuff as we work, and changing stuff tends to break stuff
* How can we efficiently and easily keep track of changes?
    * *Not* by naming our files things like *analysis_script_v6_April_2017_final_final_TY_edits.doc.txt.bak*
* Formal version control tools (e.g., git, SVN) are ubiquitous in software development
    * Every single change to every document can be incrementally tracked

## 2b. Doing it in front of other people
* If you maintain your code/data/results publicly, there are many other benefits
    * People are more likely to use/cite your work ([McKiernan et al., 2016](https://elifesciences.org/content/5/e16800))
    * People are more likely to contribute to your work
    * Fewer "I'm sorry, I don't know where those data are" emails to write

## 3. Interactive notebooks
* A lab notebook, in interactive, digital form
* Consolidates text/markdown, code, results, figures, etc.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

plt.scatter(np.random.normal(size=100), np.random.normal(size=100))

* In the best case, essentially a [fully reproducible scientific paper](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#reproducible-academic-publications)
* These slides are a [Jupyter](http://jupyter.org) notebook

## 4. Test everything
* In psychology, we frequently talk about "manipulation checks"
* The same notion is formalized and made ubiquitous in software testing
* Write tests that automatically verify that code does what you expect it to
    * Prevents changes to one part of codebase breaking others
* E.g., if you wrote your own statistical routine, verify that it correctly produces known results
* There are all kinds of tools to facilitate fully integrated and automated testing

In [None]:
def my_addition_function(x, y):
    return x + 2*y

In [None]:
def test_addition_function():
    assert my_addition_function(4, 6) == 10

In [None]:
test_addition_function()

## 5. Accept error
* Bugs are inevitable in both software and science
* Enormous cultural differences in recognition/acceptance of error
* Why?
    * The immutability of peer-reviewed journal publications imposes huge costs to admissions of error
    * Open-source software allows and encourages rapid changes and improvement
* And no, people do *not* point at laugh when someone acknowledges an error
    * They say, "thank you for fixing that"

# So why aren't we doing all this stuff?
* We are! Scientists have been making huge strides
* But that's no reason for complacency
* Many people are still not on board
* A couple of common objections (and rebuttals)...

## "I don't know how to do this stuff!"
* Most scientists don't receive formal training in programming or scientific computing
* Rebuttal:
    * Scientists are smart! We can learn!
    * There's no question that different training will be required
    * We should think of scientific computing the way we think of statistics: as a fundamental prerequisite for doing serious work

## "This all takes a lot of time and effort, and I'm really busy!"
* It's easier to write the minimal, messy script than to create clean, reproducible, shareable repositories
* Rebuttal:
    * Oh yeah? How much time do you waste trying to reconstruct/reproduce your own buggy, opaque analyses from 6 months ago?
    * May be true in the short term, but efficiency increases dramatically in the long term
    * "You can do anything in just 5 lines of R code, but learning to write those 5 lines will take you a few years" --someone on Twitter

## "They're all going to laugh at me!"
* Showing everyone your work will lead them to pointing out all the problems with it
* Rebuttal:
    * Let's say they do... they're probably saving you from more laughs in the long term
    * You should be so lucky
    * People are way more understanding than you might think (e.g., because they have the same doubts)

## "The incentives are broken!"
* The current system incentivizes fast-and-flashy work and disincentivizes careful, methodical, reproducible research
* Rebuttal:
    * It also arguably incentivizes outright data fabrication, but we don't do that, right?
    * Not clear that this is true any more; one can now earn recognition for doing reproducible work
    * There are huge incentives to learning scientific computing (e.g., having excellent career options outside of academia)

# Scientific computing as the new statistics
* We're failing our students if we don't teach them the scientific computing skills they need in order to conduct reproducible science
* All of the arguments one hears against making such training ubiquitous could also be applied to statistics
* Widespread training in scientific computing would alleviate many of our reproducibility/replication problems for free