# How Would OHBM SEA-SIG Responsibly Do Reproducible Analysis

## Problem Statement

Here’s a simple "real world" problem. I have an MRI dataset organized following the Brain Imaging Data Structure (BIDS) community standard; I have a container from the collection of BIDS Apps I want to run (`mriqc` for instance here); and I’d like not only to "manage" the process and output in a reproducible way but also to track the carbon footprint of the run.

## OHBM SEA-SIG Solution

### In theory

To achieve reproducibility, the dataset and the container should be under content management, and you should run your processing in a way that manages the output, following ["Volume 1: How Would ReproNim Do Local Container Analysis"](https://how-would.repronim.org/en/latest/vol01/localcontainer.html). Practically, this means that we should put the image file and the container under DataLad control and run the container on the data using `dataLad run-containers`. 

To track the carbon footprint of your run, there exists a few solution in Python for the moment such as [carbontracker](https://github.com/lfwa/carbontracker), [experiment-impact-tracker](https://github.com/Breakend/experiment-impact-tracker), and [codecarbon](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwig6J-1o7b0AhWziP0HHczsDP8QFnoECAcQAQ&url=https%3A%2F%2Fcodecarbon.io%2F&usg=AOvVaw0H9DY5tnp8PbQ9i-3U32ES). Here we will use `codecarbon`. You will see that it integrates with the addition of a few lines in a Python code and estimates the amount of carbon dioxide (CO2) produced by its execution.

### In practice

#### 1. Import the different packages

In [None]:
# Utilities
import os
import logging
from pathlib import Path

# Make datalad.api happy in the notebokk
import nest_asyncio
nest_asyncio.apply()

# Import Datalad Python API
import datalad.api as dl

# Import the carbon tracker of codecarbon
from codecarbon import EmissionsTracker

#### 2. Create a dataset to contain mriqc output 

In [None]:
projectds = 'ds000003-qc'
projectds_dir = (Path(os.getcwd()) / projectds)
ds = dl.create(path=projectds_dir, cfg_proc='text2git')
# Change the current working directory to the project dataset directory
os.chdir(ds.path)

#### 3. Install the [ReproNim/containers collection](https://github.com/ReproNim/containers)

[ReproNim/containers](https://github.com/ReproNim/containers) provides a DataLad dataset (git/git-annex repository) that references a collection of popular computational tools provided within ready to use containerized environments.
The code below installs the `ReproNim/containers` as a subdataset under the ``containers`` directory:

In [None]:
ds_containers = dl.install(
    dataset=ds,
    source='///repronim/containers',
    path='containers'
)

`ds.containers_list(recursive=True)` allows you to show the list of tools available:

In [None]:
ds.containers_list(recursive=True)

#### 4. Freeze `mriqc` version to ``0.16.0``

In [None]:
res = ds.run(
    message="Downgrade/Freeze mriqc container version",
    cmd='containers/scripts/freeze_versions "bids-mriqc=0.16.0"'
)

#### 5. Install input dataset

We use the dataset redistributed by ReproNim for the testing `repronim/containers` originating from OpenNeuro ds000003 dataset.

In [None]:
ds_sourcedata = dl.install(
    dataset=ds,
    source='https://github.com/ReproNim/ds000003-demo',
    path='sourcedata'
)

#### 6. Run `mriqc` with `datalad run-containers` with carbon footprint tracking

To track carbon footprint during the execution of `mriqc` with `datalad run-containers`, you will just need to surround the invocation of the `containers_run` (the Datalad Python API equivant of `datalad run-containers`) with a few lines of `codecarbon` code as follows:

In [None]:
# Create a code/ directory in the dataset
# where codecarbon will save CO2 emissions
# in a CSV file called "emissions.csv"
code_dir = str(Path(ds.path) / "code")
os.makedirs(code_dir, exist_ok = True)

# Create and start the carbon footprint tracker
tracker = EmissionsTracker(
    project_name=f"mriqc-0.16.0",
    output_dir=code_dir,
    measure_power_secs=15,
)
tracker.start()

# Run `mriqc` with `datalad run-containers`
mriqc_run_results = ds.containers_run(
    cmd='{inputs} {outputs} participant group -w workdir',
    inputs=["sourcedata"],
    outputs=["derivatives"],
    container_name="containers/bids-mriqc"
)

# Stop the carbon footprint tracker and get the estimated CO2 emissions
emissions: float = tracker.stop()
print(f"CARBON FOOTPRINT:\n\t* Estimated Co2 emissions = {emissions} kg")