## On reproducible research in single-cell genomics

#### Structure

* [What and why](#bullet1)
* [Main takeaways](#bullet2)
* [Structuring a data science project and script templates](#bullet3)
* [Reproducible results](#bullet4)
* [Documentation](#bullet5)
* [Upload/retrieve your data](#bullet6)

### Environment setup

In [None]:
# basic modules
import os, re, time, importlib
import sys, warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
from IPython.display import clear_output

In [None]:
# in-house/developing modules
# tools modules
import scanpy as sc
import matplotlib.pyplot as plt
sc.logging.print_versions()

In [None]:
# setting visualisation parameters
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(10, 10))

In [None]:
print("Working at:\n" + os.getcwd()); time.sleep(3); clear_output()
print("Environment:", re.sub('.*conda', 'conda', os.__file__));
print("Working at:", os.path.basename(os.getcwd()))

### What and why <a class="anchor" id="bullet1"></a>

A study is reproducible when the results can be replicated using the original data, code, and documentation{cite}`Alston_Rick_2021`. However, a study can be more or less reproducible falling into a gradient between no replication and full reproducibility{cite}`Peng_2011`. Therefore, our goal is to provide the necessary tools that should propel you on the spectrum towards full reproducibility.

If your research is not reproducible, it can not be validated, mistrust grows around it, and it slows down the implementation and comparison of your methods. You end up with a reproducibility crisis and not only for other people’s work but also your own{cite}`Baker_2016_Ball_2018`. Currently, there are incentives to alleviate this crisis: data and code availability has become a minimum requirement for publication for journals, and research councils and agencies{cite}`Announcement_2016`. Furthermore, when results are questionable, reproducibility can make the job of finding the mistakes easier. And when findings are interesting, it is easier to build upon them.

Everyone strives to make their projects reproducible. Consequently, there are many reproducible tools and advice out there{cite}`Alston_Rick_2021`. Here, we present to you the tools and practices the single-cell community uses.

### Main takeaways <a class="anchor" id="bullet2"></a>

### Structuring a data science project and script templates <a class="anchor" id="bullet3"></a>

### Reproducible results <a class="anchor" id="bullet4"></a>

There are two main technical aspects for reproducible results: package’s versions and random seeds. Environment managers are tools that allow you to have control over the package versions you use. Currently, the single-cell community uses two programming languages: python and R{cite}`Zappia_Phipson_Oshlack_2018`. Each of them has their own set of environment managers. The most popular one in python is conda. As for R, renv was more recently developed.
Random seeds effects are reflected when you rerun an analysis…
There might be several algorithms you would need to set a seed for.
When we come into the GPU world, through [PyTorch](https://pytorch.org/docs/stable/notes/randomness.html), other requirements for reproducibility appear: the platform and the processor.

### Documentation <a class="anchor" id="bullet5"></a>

### Upload/retrieve your data <a class="anchor" id="bullet6"></a>

Your data is incredibly valuable to the community. The odds of it being reachable drops significantly each year after publication{cite}`Vines_Albert_Andrew_Débarre_Bock_Franklin_Gilbert_Moore_Renaut_Rennison_2014`. So, documenting it and storing it in a permanent place is of the utmost importance. Some repositories for raw and processed data are GEO, EGA, and Zenodo. There are other repositories that allow the exploration of your single-cell data. Some examples are cellxgene, FASTGenomics, and UCSC Cell Browser.

However uploading the data to a repository is not enough, your data needs to be findable, accessible, interoperable, and reusable{cite}`Wilkinson_Dumontier_Aalbersberg_Appleton_Axton_Baak_Blomberg_Boiten_da Silva Santos_Bourne_et al._2016`. These are known as FAIR Principles. Furthermore, your metadata is key for not only reproducing your results but to understand your data and re-adapt it to integrate with other’s. CEDAR provides a template for comprehensive FAIR metadata templates{cite}`Musen_Bean_Cheung_Dumontier_Durante_Gevaert_Gonzalez-Beltran_Khatri_Kleinstein_O’Connor_et al._2015`.