# DataLad Demo

## What is DataLad?

- “A free and open-source data management system for everyone.”
- Helps obtain, track modifications of, and (re)share research data and code.
- Built on:
  - Git, a version control system often used for managing code.
  - git-annex, which helps git manage large files.
- Manages directories with files, and keeps track of the relationships between those files.
- Traditionally a command-line tool, but a GUI (DataLad Gooey) was recently released.
https://docs.datalad.org/projects/gooey/en/latest/index.html 

## Datasets

- A DataLad dataset is a directory managed by DataLad
- All files contained in a dataset are tracked by DataLad
- Can be nested: A dataset can contain one or more subdatasets.

## Creating Datasets

- To create an empty dataset:
  - `datalad create {path}`
- Let's create an example dataset:

In [None]:
# Git will complain if you haven't set up an identity:
!git config --global user.name "Tristan"
!git config --global user.email "example@example.com"

In [None]:
# -c text2git means that text files won't be annexed. More on this later...
!datalad create -c text2git datalad-tutorial

## Modifying datasets

- For DataLad to keep track of your files, you need to create and save them in a dataset.
- Let’s add a README file and save it to our dataset.

In [None]:
!echo "This is an example DataLad dataset." > datalad-tutorial/README.md

- `datalad status` will show any unsaved files in the dataset.

In [None]:
!datalad -C datalad-tutorial status

- `datalad save` will save any new files or modifications in a dataset.
- Adding `-m {some message}` will save a message to the commit log.
  - This is a good idea, the default messages are generally pretty unhelpful.

In [None]:
!datalad -C datalad-tutorial save -m “Add README” .

- To see the commit log, run `git log`

In [None]:
!git -C datalad-tutorial log

## Obtaining datasets

- `datalad clone {dataset-url}` will install an existing dataset to your filesystem
- A few sources of Open Data available with DataLad:
  - DataLad Repository: http://datasets.datalad.org
  - CONP Portal: https://portal.conp.ca
  - OpenNeuro: https://openneuro.org
- For the purposes of this tutorial, let’s download this MRI dataset from OpenNeuro: https://openneuro.org/datasets/ds000105
- `datalad clone -d . https://github.com/OpenNeuroDatasets/ds000105.git`
  - The `-d .` part instructs DataLad to install the dataset as a subdataset of the current one.

In [None]:
!datalad -C datalad-tutorial clone -d . https://github.com/OpenNeuroDatasets/ds000105.git

In [None]:
!ls datalad-tutorial/ds000105/sub-1/anat

## Annexed files

- DataLad keeps track of larger files with git-annex.
- Cloned files will be visible on the file system but not actually present at first.
- In DataLad terminology, these files are “annexed.”
- `datalad get {file}` will download the annexed file.
- `git annex whereis {file}` shows where an annexed file is actually stored.
- Let’s download one of the files in the dataset we just cloned

In [None]:
!cd datalad-tutorial/ds000105 && git-annex whereis sub-1/anat/sub-1_T1w.nii.gz

In [None]:
!datalad -C datalad-tutorial get ds000105/sub-1/anat/sub-1_T1w.nii.gz

## Modifying files

- Say we want to run an analysis on the dataset we downloaded and store the results as a new dataset.

### Setup

- We’ll start by making a subdataset to store our analysis code and one to store our outputs:

In [None]:
!datalad -C datalad-tutorial create -c text2git -d . code
!datalad -C datalad-tutorial create -d . outputs

- Then we’ll save an analysis script in the “code” subdataset
- Note: `datalad save . -r` recursively saves all changes to subdatasets.

In [None]:
!cp analyzeimage.py datalad-tutorial/code && datalad -C datalad-tutorial save . -r

### DataLad run

- Now we can run our analysis script, save the outputs, and record the command we used to generate those outputs.
- This is all possible in one command with `datalad run`

In [None]:
!datalad -C datalad-tutorial run -m "Find sub-1 T1w image shape" \
--input "ds000105/sub-1/anat/sub-1_T1w.nii.gz" \
--output "outputs/sub-1_shape.txt" \
"python3 code/analyzeimage.py {inputs} > {outputs}"

## Sharing datasets

- Two cases: Shared infrastructure (e.g. a lab server), and third-party infrastructure (e.g. OSF).
- Shared infrastructure is relatively easy, can datalad clone the path of the dataset.


In [None]:
!datalad clone datalad-tutorial datalad-tutorial-clone

- Third-party infrastructure is possible but harder -- I won’t get into it for this tutorial.
- You can clone your dataset to GitHub, but your annexed files will not be accessible without further action:
- `datalad create-sibling-github -d . -r`
  - Note: This has some issues on the binder, but I’ll demo the results.
  - https://github.com/tkkuehn/demo-datalad-brainhack