## Downloading and preprocessing public IBL data

This notebook will walk you through how to download public IBL data needed to fit the MSPS-VAE: raw mp4 videos and DLC traces. The notebook will then preprocess this data into the correct format for the Behavenet codebase.

Note that the raw videos are large, 5-10GB each, and will thus take some time to download.

You will first need to create a new conda environment and download the `ibllib` package that will facilitate data downloading. See the instructions [here](https://int-brain-lab.github.io/iblenv/public_docs/public_one.html). Activate the `ibllib` environment, then install the `h5py` package through conda:
```
(ibllib) $: conda install h5py
```

You will then need to run this notebook with the `ibllib` kernel so that you have access to the required code. You should see the current ipython kernel name in the upper right hand corner of this notebook. If it is not `ibllib` (for example it might be `Python 3`) then change it using the dropdown menus above: `Kernel > Change kernel > ibllib`. If you do not see `ibllib` as an option run the following command in the terminal - and make sure you have activated the `ibllib` conda environment first!

```
(ibllib) $: python -m ipykernel install --user --name ibllib
```

Notes: 
* you will need to update the local paths `data_path_raw` and `data_path_proc` below
* ffmpeg is required to run the final function `test_hdf5_build`

In [None]:
import numpy as np
import os
from one.api import ONE

import sys
sys.path.append('.')
from ibl_utils.pipeline import PawProcessor

In [None]:
# ------------------------------------
# set user-defined paths
# ------------------------------------
# where raw ibl data is stored - UPDATE THIS TO YOUR LOCAL PATH
data_path_raw = '/media/mattw/data/TEST/raw_data/'
# where processed behavenet hdf5 is stored - UPDATE THIS TO YOUR LOCAL PATH
data_path_proc = '/media/mattw/data/TEST/data/'

# connect to server
one = ONE(
    base_url='https://openalyx.internationalbrainlab.org', 
    cache_dir=data_path_raw,
    password='international',
    silent=True)

In [None]:
# define public sessions used in ps-vae paper
sessions = [        
    # session 1
#     {'eid': '89f0d6ff-69f4-45bc-b89e-72868abb042a', 
#      'lab': 'churchlandlab',
#      'animal': 'CSHL047',
#      'date': '2020-01-20',
#      'number': '001'},
    # session 2
    {'eid': '4b7fbad4-f6de-43b4-9b15-c7c7ef44db4b', 
     'lab': 'churchlandlab',
     'animal': 'CSHL049',
     'date': '2020-01-08',
     'number': '001'},
    # session 3
#     {'eid': 'aad23144-0e52-4eac-80c5-c4ee2decb198', 
#      'lab': 'cortexlab',
#      'animal': 'KS023',
#      'date': '2019-12-10',
#      'number': '001'},
    # session 4
#     {'eid': '4ecb5d24-f5cc-402c-be28-9d0f7cb14b3a', 
#      'lab': 'hoferlab',
#      'animal': 'SWC_043',
#      'date': '2020-09-21',
#      'number': '001'},
]

### build hdf5

In [None]:
# ------------------------------------
# marker info
# ------------------------------------
# camera view to take frames from (paper uses left view)
view = 'left'
# likelihood threshold - markers with likelihoods below this threshold will be masked
l_thresh = 0.9

# ------------------------------------
# hdf5 info
# ------------------------------------
# True to overwrite existing hdf5 file
overwrite_hdf5 = True
# xpix of final downsampled frames
xpix = 192
# ypix of final downsampled frames
ypix = 192
# number of contiguous frames per batch
batch_size = 96
# total number of batches
n_batches = 150
# batch_selection options:
# 'me': batches with highest motion energy
# 'random': random batches
# None: use every time point in a batch
batch_selection = 'me'  

# loop over sessions
for session in sessions:

    # initialize class that handles video pipeline
    vp = PawProcessor(one, view=view, **session)
    print(vp)

    # compute paths
    vp.compute_paths(data_path_raw=data_path_raw)
    vp.paths.data_path_proc = os.path.join(
        data_path_proc, 'single-view', vp.lab, vp.animal, vp.session)
    vp.paths.hdf5_file = os.path.join(vp.paths.data_path_proc, 'data.hdf5')
    
    # determine if we need to run pipeline
    if os.path.exists(vp.paths.hdf5_file) and not overwrite_hdf5:
        print('data.hdf5 file already exists at %s; skipping\n\n' % vp.paths.hdf5_file)
        continue
        
    # download data from public server; will skip if data is already present
    vp.download_data()

    # load markers
    vp.load_2d_markers(likelihood_thresh=l_thresh)

    # load cv video capture objects
    vp.load_video_cap()
    
    # find crop params to align videos to anatomical features
    vp.find_crop_params()
            
    # update likelihoods to catch nan frames
    for m in ['paw_l', 'paw_r']:
        idxs_tmp = np.isnan(vp.markers.vals[m])[:, 0]
        vp.markers.likelihoods[m][idxs_tmp] = 0
        vp.markers.masks[m][idxs_tmp] = 0
            
    # update markers to reflect rescaling of view
    if view == 'left':
        # downsample markers from left view by a factor of 2
        for m in vp.markers.vals.keys():
            vp.markers.vals[m] /= 2

    # build hdf5 file
    print('constructing hdf5 file at %s' % vp.paths.hdf5_file)
#     vp.build_hdf5(
#         hdf5_file=vp.paths.hdf5_file, batch_size=batch_size, xpix=xpix, ypix=ypix,
#         n_batches=n_batches, batch_selection=batch_selection)

    # output test video as a sanity check
    print('testing hdf5 file at %s' % vp.paths.hdf5_file)
    save_file = os.path.join(vp.paths.data_path_proc, 'test-batch_hdf5.mp4')
    idxs = [0, 49, 99, 149]
    data_dict = vp.test_hdf5_build(vp.paths.hdf5_file, idxs=idxs, save_file=save_file)