### Insect custom parsing
- An labelled dataset of katydid, cricket, cicada, and others (i.e. bee, beetle, fruitfly, midges, mosquito, wasp). Each file is supposed to be from a different species, although species ID is not available. 
    - .WAV files with species labels
- This notebook creates a JSON corresponding to each WAV file (and Noise file where available).
- Dataset origin:
- https://link.springer.com/chapter/10.1007/978-3-319-26561-2_42
- https://link.springer.com/chapter/10.1007/978-1-4614-3501-3_24

In [1]:
from avgn.utils.general import prepare_env

In [2]:
prepare_env()

env: CUDA_VISIBLE_DEVICES=GPU


### Import relevant packages

In [3]:
from joblib import Parallel, delayed
from tqdm.autonotebook import tqdm
import pandas as pd
pd.options.display.max_columns = None
import librosa
from datetime import datetime
import numpy as np



In [4]:
import avgn
from avgn.custom_parsing.gonzalez_insects import generate_json
from avgn.utils.paths import DATA_DIR

### Load data in original format

In [5]:
# create a unique datetime identifier for the files output by this notebook
DT_ID = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
DT_ID

'2019-06-26_16-00-45'

In [6]:
DSLOC = avgn.utils.paths.Path('/mnt/cube/Datasets/insectORIG')
DSLOC

PosixPath('/mnt/cube/Datasets/insectORIG')

In [7]:
WAVLIST = list((DSLOC).expanduser().glob('*.wav'))
len(WAVLIST), WAVLIST[0]

(381, PosixPath('/mnt/cube/Datasets/insectORIG/CC_139CS.wav'))

In [8]:
wav_df = pd.DataFrame(
    [[i, i.stem.split("_")[0], i.stem.split("_")[1]] for i in WAVLIST],
    columns=["wavloc", "species_group", "species"],
)
print(len(wav_df))
wav_df[:3]

381


Unnamed: 0,wavloc,species_group,species
0,/mnt/cube/Datasets/insectORIG/CC_139CS.wav,CC,139CS
1,/mnt/cube/Datasets/insectORIG/KA_himegisu.wav,KA,himegisu
2,/mnt/cube/Datasets/insectORIG/CR_479scdg.wav,CR,479scdg


### create JSON for each animal

In [10]:
with Parallel(n_jobs=-1, verbose=10) as parallel:
    parallel(
        delayed(generate_json)(row, DT_ID)
        for idx, row in tqdm(wav_df.iterrows(), total = len(wav_df))
    )

HBox(children=(IntProgress(value=0, max=381), HTML(value='')))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1926s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done  50 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  65 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0634s.) Setting batch_size=12.
[Parallel(n_jobs=-1)]: Done  88 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done 295 out of 381 | elapsed:    4.2s remaining:    1.2s





[Parallel(n_jobs=-1)]: Done 373 out of 381 | elapsed:    4.2s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 381 out of 381 | elapsed:    4.2s finished
