### Mouse USV custom parsing
- This dataset has a number of WAV files and corresponding respiratory data that is not properly aligned to the WAV data. Respiratory data is aligned in a second dataset by the same author though. I only downloaded a subset of the large dataset here. WAVs mostly continuously contain vocalizations (a little noisy though), so I'm not further segmenting vocalization files. 
    - .WAV files of vocalizations and filename with ID
- This notebook creates a JSON corresponding to each WAV file. 
- Dataset origin:
    - https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0199929
    - https://datadryad.org/handle/10255/dryad.177144

In [33]:
from avgn.utils.general import prepare_env

In [34]:
prepare_env()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
env: CUDA_VISIBLE_DEVICES=GPU


### Import relevant packages

In [35]:
from joblib import Parallel, delayed
from tqdm.autonotebook import tqdm
import pandas as pd
pd.options.display.max_columns = None
import librosa
from datetime import datetime
import numpy as np

In [66]:
import avgn
from avgn.custom_parsing.castellucci_mouse_usv import generate_json
from avgn.utils.paths import DATA_DIR

### Load data in original format

In [37]:
DATASET_ID = 'castellucci_mouse_usv'

In [38]:
# create a unique datetime identifier for the files output by this notebook
DT_ID = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
DT_ID

'2019-06-23_22-11-52'

In [39]:
DSLOC = avgn.utils.paths.Path('/mnt/cube/Datasets/mouse_usv')
DSLOC

PosixPath('/mnt/cube/Datasets/mouse_usv')

### Parse Wav Info

In [40]:
wav_files = list(DSLOC.expanduser().glob('*/*.WAV'))
wav_files[:3]

[PosixPath('/mnt/cube/Datasets/mouse_usv/VOC591/VOC591_Isolation_Call_CMPA_8_9_2016_14_6.04.WAV'),
 PosixPath('/mnt/cube/Datasets/mouse_usv/VOC591/VOC591_VOC571_SONG_CMPA_10_7_2016_73_22.97.WAV'),
 PosixPath('/mnt/cube/Datasets/mouse_usv/VOC591/VOC591_VOC583_SONG_CMPA_9_3_2016_39_19.17.WAV')]

In [41]:
wf_df = pd.DataFrame(
    columns=[
        "indv",
        "MaleMouse",
        "FemaleMouse",
        "SONG",
        "CMPA",
        "month",
        "day",
        "year",
        "AGE",
        "Weight",
        "wav_loc",
    ]
)
for wf in tqdm(wav_files):
    if len(wf.stem.split("_")) != 9: continue
    wf_df.loc[len(wf_df)] = [wf.parent.stem] + wf.stem.split("_") + [wf]

HBox(children=(IntProgress(value=0, max=215), HTML(value='')))




In [42]:
len(wf_df)

143

In [43]:
wf_df[:3]

Unnamed: 0,indv,MaleMouse,FemaleMouse,SONG,CMPA,month,day,year,AGE,Weight,wav_loc
0,VOC591,VOC591,Isolation,Call,CMPA,8,9,2016,14,6.04,/mnt/cube/Datasets/mouse_usv/VOC591/VOC591_Iso...
1,VOC591,VOC591,VOC571,SONG,CMPA,10,7,2016,73,22.97,/mnt/cube/Datasets/mouse_usv/VOC591/VOC591_VOC...
2,VOC591,VOC591,VOC583,SONG,CMPA,9,3,2016,39,19.17,/mnt/cube/Datasets/mouse_usv/VOC591/VOC591_VOC...


In [57]:
wf_df.CMPA.unique()

array(['CMPA'], dtype=object)

### Create JSON for files

In [68]:
with Parallel(n_jobs=-1, verbose=10) as parallel:
    parallel(
        delayed(generate_json)(
            row, DT_ID
        )
        for idx, row in tqdm(wf_df.iterrows(), total=len(wf_df))
    )

HBox(children=(IntProgress(value=0, max=143), HTML(value='')))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    8.1s
[Parallel(n_jobs=-1)]: Done  50 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done  65 tasks      | elapsed:    9.7s
[Parallel(n_jobs=-1)]: Done  80 tasks      | elapsed:   10.5s





[Parallel(n_jobs=-1)]: Done 111 out of 143 | elapsed:   12.3s remaining:    3.6s
[Parallel(n_jobs=-1)]: Done 126 out of 143 | elapsed:   13.0s remaining:    1.7s
[Parallel(n_jobs=-1)]: Done 141 out of 143 | elapsed:   13.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 143 out of 143 | elapsed:   13.0s finished
