# [git-annex](https://git-annex.branchable.com/) and [Datalad](datalad.org/)

### Git
You probably already know Git, it is:
- **distributed** version control system
- GitHub, Bitbucket &ndash; a web-based hosting services for version control using git
- very efficient for managing textual files (code, text, configuration, etc.)
- **inefficient for storing data**
    - it’s not design to handle big files
    - you don’t want to upload your big files to GitHub and keep a copy of them in every single location

### git-annex
- built on top of Git
- allows managing files with git without checking the file contents into git (it commits only symbolink links to the files)
- adds "special remotes" to provide access to data content from various sources: web, Amazon S3, Dropbox, Google Drive, etc.
- both Git and git-annex largely work on a single repository level

### Datalad
- relies on Git and git-annex
- **manages multiple repositories organized into “super-datasets”**
- can crawl external online data sources, and update git/annex repositories upon changes
- is scalable since data stays with original data providers
- unifies access to data regardless of its origin (custom portals with authentication, S3, etc.) or serialization (e.g., tarballs)
- **aggregates datasets’ meta-data and allows for quick search**
- can publish original or derived datasets publicly (a web server, github) or for internal use (e.g., via ssh)
- **comes with command line and Python interfaces**

**In your scientific work Datalad can used to:**
- **Discover Data**: built-in support for metadata extraction and search
- **Consume Data**: direct access to individual files (great when you only need a few files from a large datasets)
- **Publish Data**: support for sharing datasets with the public or your lab on platforms that you are using already 
- **Reproducibility**: support for joint management of analysis code and data


**Example of Data Consumption using Python API**:

You can install `git-annex` and `Datalad` locally as described in [Datalad GitHub](https://github.com/datalad/datalad#other-linuxes-osx-windows-yet-todo-via-pip).  
You can also use the Docker Image described in Docker notebook (you can pull from DockerHub: `docker pull djarecka/repropython`

In [1]:
# First we will install a publicly available dataset
from datalad.api import install
ds = install('///openfmri/ds000001')

[INFO] Cloning http://datasets.datalad.org/openfmri/ds000001 to '/Users/dorota/trainings/pycon2018/ReproduciblePython/ds000001' 
[INFO] access to dataset sibling "datalad" not auto-enabled, enable with:
| 		datalad siblings -d "/Users/dorota/trainings/pycon2018/ReproduciblePython/ds000001" enable -s datalad 


In [5]:
# As you can see datalad cloned the dataset to local directory
# (you might have some warnings if your git is not configured)

# We can check the content of the new directory
!ls -l ds000001

total 40
-rw-r--r--  1 dorota  staff   578 May  7 23:26 CHANGES
-rw-r--r--  1 dorota  staff  2141 May  7 23:26 README
-rw-r--r--  1 dorota  staff   995 May  7 23:26 dataset_description.json
-rw-r--r--  1 dorota  staff   215 May  7 23:26 participants.tsv
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-01
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-02
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-03
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-04
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-05
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-06
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-07
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-08
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-09
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-10
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-11
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-12
drwxr-xr-x  4 dorota  staff   136 May  7 23:26 sub-13
drwxr-xr-x  4 dorota  staff   136 May  7 23:

In [3]:
# we can check files in one of the subdirectories
!ls -l ds000001/sub-10/anat

total 8
lrwxr-xr-x 1 repropython users 138 May  7 22:09 sub-10_inplaneT2.nii.gz -> ../../.git/annex/objects/FJ/GK/MD5E-s684908--ce1ce3e4fb6be073bf5e03280ea14acb.nii.gz/MD5E-s684908--ce1ce3e4fb6be073bf5e03280ea14acb.nii.gz
lrwxr-xr-x 1 repropython users 140 May  7 22:09 sub-10_T1w.nii.gz -> ../../.git/annex/objects/Qk/mv/MD5E-s5481518--675ca482ce805a7dd359e52a576a9206.nii.gz/MD5E-s5481518--675ca482ce805a7dd359e52a576a9206.nii.gz


In [4]:
# so we have a list symbolic links that points to the files, but the content was not downloaded

# you can check the size of the files
!ls -lL ds000001/sub-10/anat/*

lrwxr-xr-x  1 dorota  staff  140 May  7 23:26 ds000001/sub-10/anat/sub-10_T1w.nii.gz -> ../../.git/annex/objects/Qk/mv/MD5E-s5481518--675ca482ce805a7dd359e52a576a9206.nii.gz/MD5E-s5481518--675ca482ce805a7dd359e52a576a9206.nii.gz
lrwxr-xr-x  1 dorota  staff  138 May  7 23:26 ds000001/sub-10/anat/sub-10_inplaneT2.nii.gz -> ../../.git/annex/objects/FJ/GK/MD5E-s684908--ce1ce3e4fb6be073bf5e03280ea14acb.nii.gz/MD5E-s684908--ce1ce3e4fb6be073bf5e03280ea14acb.nii.gz
ls: sub-10_T1w.nii.gz: No such file or directory
ls: sub-10_inplaneT2.nii.gz: No such file or directory


In [6]:
# you should see the message that the files are not present

# let's now download the files for subject 10 using the get method
ds.get('sub-10/anat')

[INFO] Actually getting 2 files 


Total:   0%|          | 0.00/6.17M [00:00<?, ?B/s]
Total:   4%|▎         | 225k/6.17M [00:01<00:32, 186kB/s]
Total:  11%|█         | 684k/6.17M [00:01<00:22, 244kB/s]
Total:  16%|█▋        | 1.02M/6.17M [00:05<00:31, 165kB/s]
Total:  24%|██▍       | 1.50M/6.17M [00:05<00:21, 220kB/s]
Total:  29%|██▉       | 1.81M/6.17M [00:06<00:18, 241kB/s]
Total:  37%|███▋      | 2.27M/6.17M [00:09<00:17, 221kB/s]
sub-10/anat .. _T1w.nii.gz:  41%|████▏     | 2.27M/5.48M [00:08<00:12, 250kB/s][A



Total:  50%|████▉     | 3.06M/6.17M [00:10<00:11, 267kB/s]
Total:  66%|██████▋   | 4.10M/6.17M [00:11<00:05, 361kB/s]
Total:  85%|████████▌ | 5.26M/6.17M [00:11<00:01, 483kB/s]
Total (1 ok out of 2):  89%|████████▉ | 5.48M/6.17M [00:11<00:01, 483kB/s]
                                                                               [A
sub-10/anat .. neT2.nii.gz:   0%|          | 0.00/685k [00:00<?, ?B/s][A
Total (1 ok out of 2): 100%|██████████| 6.17M/6.17M [00:12<00:00, 541kB/s]
Total (2 ok out of 2): 100%|██████████| 6.17M/6.17M [00:13<00:00, 541kB/s]


[{'type': 'file',
  'refds': '/Users/dorota/trainings/pycon2018/ReproduciblePython/ds000001',
  'status': 'ok',
  'path': '/Users/dorota/trainings/pycon2018/ReproduciblePython/ds000001/sub-10/anat/sub-10_T1w.nii.gz',
  'action': 'get',
  'annexkey': 'MD5E-s5481518--675ca482ce805a7dd359e52a576a9206.nii.gz'},
 {'type': 'file',
  'refds': '/Users/dorota/trainings/pycon2018/ReproduciblePython/ds000001',
  'status': 'ok',
  'path': '/Users/dorota/trainings/pycon2018/ReproduciblePython/ds000001/sub-10/anat/sub-10_inplaneT2.nii.gz',
  'action': 'get',
  'annexkey': 'MD5E-s684908--ce1ce3e4fb6be073bf5e03280ea14acb.nii.gz'},
 {'action': 'get',
  'path': '/Users/dorota/trainings/pycon2018/ReproduciblePython/ds000001/sub-10/anat',
  'type': 'directory',
  'refds': '/Users/dorota/trainings/pycon2018/ReproduciblePython/ds000001',
  'status': 'ok'}]

In [7]:
# let's check the size of the files again
!ls -lL ds000001/sub-10/anat/*

-r--r--r--  1 dorota  staff  5481518 May  7 23:31 ds000001/sub-10/anat/sub-10_T1w.nii.gz
-r--r--r--  1 dorota  staff   684908 May  7 23:31 ds000001/sub-10/anat/sub-10_inplaneT2.nii.gz


In [9]:
# now the content of the files is in your local directory
# but you didn't have to download the content of entire repository to get only these two files!! :)

More examples and videos can be found [here](http://datalad.org/features.html)

In [10]:
from IPython.core.display import HTML


def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()