# DataLad Demo

## What is DataLad?

- “A free and open-source data management system for everyone.”
- Helps obtain, track modifications of, and (re)share research data and code.
- Built on:
  - Git, a version control system often used for managing code.
  - git-annex, which helps git manage large files.
- Manages directories with files, and keeps track of the relationships between those files.
- Traditionally a command-line tool, but a GUI (DataLad Gooey) was recently released.
https://docs.datalad.org/projects/gooey/en/latest/index.html 

## Datasets

- A DataLad dataset is a directory managed by DataLad
- All files contained in a dataset are tracked by DataLad
- Can be nested: A dataset can contain one or more subdatasets.

## Creating Datasets

- To create an empty dataset:
  - `datalad create {path}`
- Let's create an example dataset:

In [1]:
# Git will complain if you haven't set up an identity:
!git config --global user.name "Tristan"
!git config --global user.email "example@example.com"

In [2]:
# -c text2git means that text files won't be annexed. More on this later...
!datalad create -c text2git datalad-tutorial

[INFO   ] Running procedure cfg_text2git 
[INFO   ] == Command start (output follows) ===== 
Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/115 [00:00<?, ? Bytes/s][A
[0m                                                                            [A[INFO   ] == Command exit (modification check follows) ===== 
[1;1mrun[0m([1;32mok[0m): /home/jovyan/datalad-tutorial ([1;35mdataset[0m) [/srv/conda/envs/notebook/bin/python /srv...]
[1;1mcreate[0m([1;32mok[0m): /home/jovyan/datalad-tutorial ([1;35mdataset[0m)
action summary:
  create (ok: 1)
  run (ok: 1)
[0m

## Modifying datasets

- For DataLad to keep track of your files, you need to create and save them in a dataset.
- Let’s add a README file and save it to our dataset.

In [3]:
!echo "This is an example DataLad dataset." > datalad-tutorial/README.md

- `datalad status` will show any unsaved files in the dataset.

In [4]:
!datalad -C datalad-tutorial status

[1;31muntracked[0m: README.md ([1;35mfile[0m)
[0m

- `datalad save` will save any new files or modifications in a dataset.
- Adding `-m {some message}` will save a message to the commit log.
  - This is a good idea, the default messages are generally pretty unhelpful.

In [5]:
!datalad -C datalad-tutorial -m “Add README” save

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                    | 0.00/36.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
[0m

- To see the commit log, run `git log`

In [6]:
!git -C datalad-tutorial log

[33mcommit b49b702f4de4a4e740dbfa2b0d6d927e8b914225[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m
Author: Tristan <example@example.com>
Date:   Wed Nov 1 20:02:19 2023 +0000

    [DATALAD] Recorded changes

[33mcommit b73c4ca8aadbf60cae8ea4df0506bc11ee601dd1[m
Author: Tristan <example@example.com>
Date:   Wed Nov 1 19:59:16 2023 +0000

    Instruct annex to add text files to Git

[33mcommit d79c92185160680d20598f7e006c107efdb4fe9f[m
Author: Tristan <example@example.com>
Date:   Wed Nov 1 19:59:15 2023 +0000

    [DATALAD] new dataset


## Obtaining datasets

- `datalad clone {dataset-url}` will install an existing dataset to your filesystem
- A few sources of Open Data available with DataLad:
  - DataLad Repository: http://datasets.datalad.org
  - CONP Portal: https://portal.conp.ca
  - OpenNeuro: https://openneuro.org
- For the purposes of this tutorial, let’s download this MRI dataset from OpenNeuro: https://openneuro.org/datasets/ds000105
- `datalad clone -d . https://github.com/OpenNeuroDatasets/ds000105.git`
  - The `-d .` part instructs DataLad to install the dataset as a subdataset of the current one.

In [7]:
!datalad -C datalad-tutorial clone -d . https://github.com/OpenNeuroDatasets/ds000105.git

Cloning:   0%|                             | 0.00/2.00 [00:00<?, ? candidates/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                                | 0.00/146 [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                            | 0.00/87.0 [00:00<?, ? Objects/s][A
                                                                                [A
Receiving:   0%|                             | 0.00/1.52k [00:00<?, ? Objects/s][A
                                                                                [A
Resolving:   0%|                                | 0.00/379 [00:00<?, ? Deltas/s][A
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore           [A
[INFO   ] https://github.com/OpenNeuroDatasets/ds000105.git/config download failed: Not Found 
[INFO   ] access to 1 dataset sibling s3-PRIVATE not

In [10]:
!ls datalad-tutorial/ds000105/sub-1/anat

sub-1_T1w.nii.gz


## Annexed files

- DataLad keeps track of larger files with git-annex.
- Cloned files will be visible on the file system but not actually present at first.
- In DataLad terminology, these files are “annexed.”
- `datalad get {file}` will download the annexed file.
- `git annex whereis {file}` shows where an annexed file is actually stored.
- Let’s download one of the files in the dataset we just cloned

In [15]:
!cd datalad-tutorial/ds000105 && git-annex whereis sub-1/anat/sub-1_T1w.nii.gz

whereis sub-1/anat/sub-1_T1w.nii.gz (2 copies) 
  	1076c507-8000-4829-b9bd-969d1a766598 -- root@1f69c4ed80cf:/datalad/ds000105
   	9db26647-95d0-4561-8af5-1fff2da7f31f -- [s3-PUBLIC]

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds000105/sub-1/anat/sub-1_T1w.nii.gz?versionId=V939ZM0yObjUczY_D2Lu8a1Mgt2T8uQ8
ok


In [16]:
!datalad -C datalad-tutorial get ds000105/sub-1/anat/sub-1_T1w.nii.gz

Total:   0%|                                   | 0.00/6.81M [00:00<?, ? Bytes/s]
Get sub-1/an .. 1_T1w.nii.gz:   0%|            | 0.00/6.81M [00:00<?, ? Bytes/s][A
Get sub-1/an .. 1_T1w.nii.gz:   1%|    | 68.2k/6.81M [00:00<00:16, 398k Bytes/s][A
Get sub-1/an .. 1_T1w.nii.gz:   3%|▏    | 207k/6.81M [00:00<00:09, 677k Bytes/s][A
Get sub-1/an .. 1_T1w.nii.gz:   7%|▎   | 451k/6.81M [00:00<00:05, 1.26M Bytes/s][A
Get sub-1/an .. 1_T1w.nii.gz:  34%|█  | 2.35M/6.81M [00:00<00:00, 5.29M Bytes/s][A
Get sub-1/an .. 1_T1w.nii.gz:  49%|█▍ | 3.37M/6.81M [00:00<00:00, 6.55M Bytes/s][A
Get sub-1/an .. 1_T1w.nii.gz:  82%|██▍| 5.62M/6.81M [00:00<00:00, 8.27M Bytes/s][A
[1;1mget[0m([1;32mok[0m): ds000105/sub-1/anat/sub-1_T1w.nii.gz ([1;35mfile[0m) [from s3-PUBLIC...]
action summary:
  get (notneeded: 1, ok: 1)
[0m

## Modifying files

- Say we want to run an analysis on the dataset we downloaded and store the results as a new dataset.

### Setup

- We’ll start by making a subdataset to store our analysis code and one to store our outputs:

In [17]:
!datalad -C datalad-tutorial create -c text2git -d . code
!datalad -C datalad-tutorial create -d . outputs

[INFO   ] Running procedure cfg_text2git 
[INFO   ] == Command start (output follows) ===== 
Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/115 [00:00<?, ? Bytes/s][A
[0m                                                                            [A[INFO   ] == Command exit (modification check follows) ===== 
[1;1mrun[0m([1;32mok[0m): /home/jovyan/datalad-tutorial/code ([1;35mdataset[0m) [/srv/conda/envs/notebook/bin/python /srv...]
[1;1madd[0m([1;32mok[0m): code ([1;35mdataset[0m)                         
Total:   0%|                                 | 0.00/1.00 [00:00<?, ? datasets/s]
                                                                                [A
[A[1;1madd[0m([1;32mok[0m): .gitmodules ([1;35mfile[0m)                  
Total:   0%|                                 | 0.00/1.00 [00:00<?, ? datasets/s]
Total:   0%|                                     | 0.00/310 [00:00<?, ? Bytes/s][A
[1;1msave

- Then we’ll save an analysis script in the “code” subdataset
- Note: `datalad save . -r` recursively saves all changes to subdatasets.

In [18]:
!cp analyzeimage.py datalad-tutorial/code && datalad -C datalad-tutorial save . -r

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/214 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): analyzeimage.py ([1;35mfile[0m)                 [A
[1;1msave[0m([1;32mok[0m): code ([1;35mdataset[0m)                        
Total (2 skipped):  75%|███████████▎   | 3.00/4.00 [00:00<00:00, 141 datasets/s]
                                                                                [A
[A[1;1madd[0m([1;32mok[0m): code ([1;35mdataset[0m)                      
Total (2 skipped):  75%|██████████▌   | 3.00/4.00 [00:00<00:00, 11.8 datasets/s]
                                                                                [A
[A[1;1madd[0m([1;32mok[0m): .gitmodules ([1;35mfile[0m)                  
Total (2 skipped):  75%|██████████▌   | 3.00/4.00 [00:00<00:00, 11.8 datasets/s]
Total:   0%|                                     | 0.00/416 [00:00<?, ? Bytes/s][A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0

### DataLad run

- Now we can run our analysis script, save the outputs, and record the command we used to generate those outputs.
- This is all possible in one command with `datalad run`

In [23]:
!datalad -C datalad-tutorial run -m "Find sub-1 T1w image shape" \
--input "ds000105/sub-1/anat/sub-1_T1w.nii.gz" \
--output "outputs/sub-1_shape.txt" \
"python3 code/analyzeimage.py {inputs} > {outputs}"

[INFO   ] Making sure inputs are available (this may take some time) 
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
[1;1mrun[0m([1;32mok[0m): /home/jovyan/datalad-tutorial ([1;35mdataset[0m) [python3 code/analyzeimage.py ds000105/su...]
Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                    | 0.00/16.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): sub-1_shape.txt ([1;35mfile[0m)                 [A
[1;1msave[0m([1;32mok[0m): outputs ([1;35mdataset[0m)                     
[1;1madd[0m([1;32mok[0m): outputs ([1;35mdataset[0m)                      
[1;1madd[0m([1;32mok[0m): .gitmodules ([1;35mfile[0m)                     
Total (2 skipped):  75%|██████████▌   | 3.00/4.00 [00:00<00:00, 44.8 datasets/s]
Total:   0%|                                     | 0.00/416 [00:00<?, ? Bytes/s][A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                  

## Sharing datasets

- Two cases: Shared infrastructure (e.g. a lab server), and third-party infrastructure (e.g. OSF).
- Shared infrastructure is relatively easy, can datalad clone the path of the dataset.


In [24]:
!datalad clone datalad-tutorial datalad-tutorial-clone

[1;1minstall[0m([1;32mok[0m): /home/jovyan/datalad-tutorial-clone ([1;35mdataset[0m)
[0m

- Third-party infrastructure is possible but harder -- I won’t get into it for this tutorial.
- You can clone your dataset to GitHub, but your annexed files will not be accessible without further action:
- `datalad create-sibling-github -d . -r`
  - Note: This has some issues on the binder, but I’ll demo the results.
  - https://github.com/tkkuehn/demo-datalad-brainhack