### North American birds dataset custom parsing
- This dataset has a number of WAV files segmented into what are essentially syllables (however some vocalizations are not quite syllables). There are around 3000 vocalizations. The dataset contains:
    - .WAV files of vocalizations and filename with ID and species
- This notebook creates a JSON corresponding to each WAV file. 
- Dataset origin:
    - https://zenodo.org/record/1250690#.XQAO_G9KjUI
    - https://www.sciencedirect.com/science/article/pii/S157495411630231X
    - https://ieeexplore.ieee.org/document/8462156

In [1]:
from avgn.utils.general import prepare_env

In [2]:
prepare_env()

env: CUDA_VISIBLE_DEVICES=GPU


### Import relevant packages

In [3]:
from joblib import Parallel, delayed
from tqdm.autonotebook import tqdm
import pandas as pd
pd.options.display.max_columns = None
import librosa
from datetime import datetime
import numpy as np



In [4]:
import avgn
from avgn.custom_parsing.north_america_birds import generate_json
from avgn.utils.paths import DATA_DIR

### Load data in original format

In [5]:
DATASET_ID = 'NA-Birds'

In [6]:
# create a unique datetime identifier for the files output by this notebook
DT_ID = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
DT_ID

'2019-06-23_22-56-05'

In [7]:
DSLOC = avgn.utils.paths.Path('/mnt/cube/Datasets/NABirdSpecies/North American bird species/')
DSLOC

PosixPath('/mnt/cube/Datasets/NABirdSpecies/North American bird species')

In [8]:
wav_list = list(DSLOC.expanduser().glob('*/*.wav'))
wav_list[:3], len(wav_list)

([PosixPath('/mnt/cube/Datasets/NABirdSpecies/North American bird species/S3(Great Blue Heron)/s (19).wav'),
  PosixPath('/mnt/cube/Datasets/NABirdSpecies/North American bird species/S3(Great Blue Heron)/s (52).wav'),
  PosixPath('/mnt/cube/Datasets/NABirdSpecies/North American bird species/S3(Great Blue Heron)/s (242).wav')],
 3101)

In [9]:
wav_df = pd.DataFrame(columns = ['species', 'wavloc', 'wavnum'])
for wf in tqdm(wav_list):
    wavnum = int(wf.stem.split('(')[1][:-1])
    species = wf.parent.stem.split('(')[1][:-1]
    wav_df.loc[len(wav_df)] = [species, wf, wavnum]

HBox(children=(IntProgress(value=0, max=3101), HTML(value='')))




In [10]:
wav_df = wav_df[wav_df.species != 'б░unknownб▒ events']

In [11]:
print(len(wav_df))
wav_df[:3]

2762


Unnamed: 0,species,wavloc,wavnum
0,Great Blue Heron,/mnt/cube/Datasets/NABirdSpecies/North America...,19
1,Great Blue Heron,/mnt/cube/Datasets/NABirdSpecies/North America...,52
2,Great Blue Heron,/mnt/cube/Datasets/NABirdSpecies/North America...,242


### Generate JSON for files

In [12]:
with Parallel(n_jobs=-1, verbose=10) as parallel:
    parallel(
        delayed(generate_json)(
            row, DT_ID
        )
        for idx, row in tqdm(
            wav_df.iterrows(),
            total=len(wav_df),
        )
    )

HBox(children=(IntProgress(value=0, max=2762), HTML(value='')))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1883s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done  50 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  65 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0550s.) Setting batch_size=14.
[Parallel(n_jobs=-1)]: Done  86 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done 119 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1667s.) Setting batch_size=32.
[Parallel(n_jobs=-1)]: Done 183 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 358 tasks      | elapsed:    4.4s
[Parallel(n_jobs=-1)]: Done 572 tasks      | elapsed




[Parallel(n_jobs=-1)]: Done 1118 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 2762 out of 2762 | elapsed:    5.0s finished
