# How Would OHBM SEA-SIG Responsibly Do Reproducible Analysis

## Problem

We have one MRI dataset organized following the [Brain Imaging Data Structure (BIDS)](https://bids-specification.readthedocs.io/en/stable/) community standard; we want to run the container of one BIDS App (`mriqc` for instance here); and we’d like not only to "manage" the process and output in a reproducible way but also to track the carbon footprint of the run.

## OHBM SEA-SIG Solution

### In theory

#### Reproducible analysis of a dataset

To achieve full reproducibility of an analysis of a BIDS dataset with a container, best practices suggest that the dataset and the container should be under content management, and you should run your processing in a way that manages the output. At first, this can sound quite complicated especially if you are not familiar with version control system such as `Git`, but [ReproNim: A Center for Reproducible Neuroimaging Computation](https://www.repronim.org) and many other collaborators have contributed during the past 5 years with a great series of tools, principles, and tutorials that have made this task very accessible, that we will adopt. They are the following: 

*   [`Datalad`](https://www.datalad.org/): open source distributed data management system built on top of git and git-annex.

    > "DataLad is a free and open source distributed data management system that keeps track of your data, creates structure, ensures reproducibility, supports collaboration, and integrates with widely used data infrastructure." - ["https://www.datalad.org/"](https://www.datalad.org/)

*   [`datalad-container`](https://pypi.org/project/datalad-container/): DataLad extension for containerized environments.

    > "This extension equips DataLad’s run/rerun functionality with the ability to transparently execute commands in containerized computational environments. On re-run, DataLad will automatically obtain any required container at the correct version prior execution." - ["http://docs.datalad.org/projects/container"](http://docs.datalad.org/projects/container)

*   `YODA` ([YODA’s organigram on data analysis](https://github.com/myyoda)) principles.
    
    > "The principles outlined in YODA set simple rules for directory names and structures, best-practices for version-controlling dataset elements and analyses, facilitate usage of tools to improve the reproducibility and accountability of data analysis projects, and make collaboration easier." - ["Datalad Handbook Chapter 6.2 - YODA: Best practices for data analyses in a dataset"][2]
    
*   [`ReproNim/containers`](): Datalad dataset with a collection of 40 popular computational tools provided within ready to use containerized environments.
    Designed to be easily included as a subdataset within larger study (super)datasets to facilitate rapid and
    reproducible computation, while adhering to [YODA principles] and retaining clear and unambiguous
    association between data, code, and computing environments using DataLad and its `datalad-container`extension.

So practically, this means that we should adhere to the YODA principles and put the project dataset under Datalad control, install the Datalad dataset of the input BIDS dataset and the `ReproNim/containers` as subdatasets, and run the container on the data using [`datalad run-containers`](http://docs.datalad.org/projects/container/en/stable/generated/man/datalad-containers-run.html#) which will record what you did to the data, and with what.

Thanks to git-annex, when a Datalad dataset is installed, the copy does not contain the contents of all annexed files by default, but display only file names. It is only when a file is needed for analysis that its content can be retrieved, which makes it very efficient in dealing with very large datasets; we should expect a positive impact on the carbon emissions indured by this process. Adopting the ecosystem of ReproNim tools can thus not only provides us with a complete and easy way to achieve full reproducibility of an analysis of a BIDS dataset, but also with a more responsible way to handle, reuse, and share our data and analysis.

##### Want to know more about Datalad  and the YODA principles?

Check the [Datalad handbook](https://handbook.datalad.org) (YODA principles available in ["Chapter 6.2 - YODA: Best practices for data analyses in a dataset"][2]) as well as the tutorials ["ReproIn/DataLad: A complete portable and reproducible fMRI study from scratch"](http://www.repronim.org/sfn2018-training/04-02-reproin/) and ["How Would ReproNim Do Local Container Analysis"][1].

#### Tracking of CO2 emissions of code execution

To track the carbon footprint of your run, there exists a few solution in Python for the moment such as [carbontracker](https://github.com/lfwa/carbontracker), [experiment-impact-tracker](https://github.com/Breakend/experiment-impact-tracker), and [codecarbon](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwig6J-1o7b0AhWziP0HHczsDP8QFnoECAcQAQ&url=https%3A%2F%2Fcodecarbon.io%2F&usg=AOvVaw0H9DY5tnp8PbQ9i-3U32ES). Here we will use `codecarbon`, but all solutions (1) works on systems with Intel chips (that supports the RAPL or powergadget interfaces) and NVIDIA GPUs and necessitates the embedding of only a few lines in the code, and (3) estimates the amount of carbon dioxide (CO2) produced by its execution. Going from one tool to another should be achieved with a relatively little effort.

[1]: <https://how-would.repronim.org/en/latest/vol01/localcontainer.html> "How Would ReproNim Do Local Container Analysis"
[2]: <https://handbook.datalad.org/en/latest/basics/101-127-yoda.html> "Datalad Handbook Chapter 6.2 - YODA: Best practices for data analyses in a dataset"

### In practice

To show this in practice, we build on the [Typical workflow example](https://github.com/ReproNim/containers#a-typical-workflow) of the README of `ReproNim/containers` (which shows how to adhere to the YODA principles and use datalad, datalad-containers, and `ReproNim/containers` to run [`mriqc 0.16.0`](https://mriqc.readthedocs.io) on a [demo BIDS dataset](https://github.com/ReproNim/ds000003-demo) in a fully reproducible manner from the terminal), that is translated into Python using `datalad.api` (the Python API of Datalad) and extended with the embedding of the following codecarbon code responsible for the estimation of the carbon footprint indurred by the execution of `mriqc`:
```python
# Create and start the carbon footprint tracker
tracker = EmissionsTracker()
tracker.start()

[...]

# Stop the carbon footprint tracker and get the estimated CO2 emissions
emissions: float = tracker.stop()
```

#### 1. Import the different packages

In [None]:
# Utilities
import os
import logging
from pathlib import Path

# Make datalad.api happy in the notebokk
import nest_asyncio
nest_asyncio.apply()

# Import Datalad Python API
import datalad.api as dl

# Import the carbon tracker of codecarbon
from codecarbon import EmissionsTracker

#### 2. Create the project dataset that will contain `mriqc` output 

The code below creates a new Datalad dataset in `ds000003-qc` directory and tells Datalad to use Git to manage text files:

In [None]:
projectds = 'ds000003-qc'
projectds_dir = (Path(os.getcwd()) / projectds)
ds = dl.create(path=projectds_dir, cfg_proc='text2git')

Change the current working directory to the project dataset directory:

In [None]:
os.chdir(ds.path)

#### 3. Install the [ReproNim/containers dataset](https://github.com/ReproNim/containers)

The code below installs the `ReproNim/containers` as a subdataset under the ``containers`` directory:

In [None]:
ds_containers = dl.install(
    dataset=ds,
    source='///repronim/containers',
    path='containers'
)

`ReproNim/containers` currently references a collection of 40 popular computational tools provided within ready to use containerized environments, which can be shown with `ds.containers_list(recursive=True)`.

In [None]:
ds.containers_list(recursive=True)

#### 4. Freeze `mriqc` version to ``0.16.0``

`ReproNim/containers` provides a collection of utility scripts in `scripts` directory. The code below runs `freeze_versions`  with `datalad run` to ensure `mriqc 0.16.0` will be used:

In [None]:
res = ds.run(
    message="Downgrade/Freeze mriqc container version",
    cmd='containers/scripts/freeze_versions "bids-mriqc=0.16.0"'
)

 All modifications made by the execution of the script are thus recorded with a brief and comprehensive `message`.

#### 5. Install input dataset

The code below installs the dataset redistributed by ReproNim for their [Typical workflow example](https://github.com/ReproNim/containers#a-typical-workflow) of the `ReproNim/containers`, a sample of the [OpenNeuro ds000003 dataset](https://openneuro.org/datasets/ds000003) with two participants.

In [None]:
ds_sourcedata = dl.install(
    dataset=ds,
    source='https://github.com/ReproNim/ds000003-demo',
    path='sourcedata'
)

#### 6. Run `mriqc` with `datalad run-containers` with carbon footprint tracking

To track carbon footprint during the execution of `mriqc` with `datalad run-containers`, we just need to surround the invocation of the `containers_run` (the Datalad Python API equivant of `datalad run-containers`) with a few lines of `codecarbon` code as follows:

In [None]:
# Create a code/ directory in the dataset
# where codecarbon will save CO2 emissions
# in a CSV file called "emissions.csv"
code_dir = str(Path(ds.path) / "code")
os.makedirs(code_dir, exist_ok = True)

# Create and start the carbon footprint tracker
tracker = EmissionsTracker(
    project_name=f"mriqc-0.16.0",
    output_dir=code_dir,
    measure_power_secs=15,
)
tracker.start()

# Run `mriqc` with `datalad run-containers`
mriqc_run_results = ds.containers_run(
    cmd='{inputs} {outputs} participant group -w workdir',
    inputs=["sourcedata"],
    outputs=["derivatives"],
    container_name="containers/bids-mriqc"
)

# Stop the carbon footprint tracker and get the estimated CO2 emissions
emissions: float = tracker.stop()
print(f"CARBON FOOTPRINT:\n\t* Estimated Co2 emissions = {emissions} kg")

where we tells `datalad run-containers` to run the container `"containers/bids-mriqc"` with the command `cmd='{inputs} {outputs} participant group -w workdir'`. `{inputs}` and `{outputs}` are placeholders that are filled with the value of `inputs` (here `"sourcedata"`) and `outputs` (here `"derivatives"`).

## Conclusion

You now know how to analyze the demo BIDS dataset with a container of the collection of `ReproNim/containers`.
From this basic knowledge you should be able with little effort to customize this approach
to use any of the 40 tools provided by `ReproNim/containers` (such as `fmriprep`, `qsiprep`,...) on the collection of datasets already available to the community on OpenNeuro or on your own BIDS dataset that is under Datalad control, and track your carbon emissions.