### Chiffchaff, little owl, tree pipit custom parsing
- This dataset has:
    - A number of CSVs with individual ID corresponding to each WAV file
    - WAV files for vocalization
    - WAV files with only background noise for each vocalization
- This notebook creates a JSON corresponding to each WAV file (and Noise file where available).
- Dataset origin:
    - https://zenodo.org/record/1413495#.XQ0UM29KjUK
    - https://href.li/?https://royalsocietypublishing.org/doi/full/10.1098/rsif.2018.0940

In [1]:
from avgn.utils.general import prepare_env

In [2]:
prepare_env()

env: CUDA_VISIBLE_DEVICES=GPU


### Import relevant packages

In [3]:
from joblib import Parallel, delayed
from tqdm.autonotebook import tqdm
import pandas as pd
pd.options.display.max_columns = None
import librosa
from datetime import datetime
import numpy as np



In [4]:
import avgn
from avgn.custom_parsing.stowell_birds import parse_csv, generate_json
from avgn.utils.paths import DATA_DIR

### Load data in original format

In [5]:
# create a unique datetime identifier for the files output by this notebook
DT_ID = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
DT_ID

'2019-06-23_11-01-16'

In [6]:
DSLOC = avgn.utils.paths.Path('/mnt/cube/Datasets/StowellBirdID/')
DSLOC

PosixPath('/mnt/cube/Datasets/StowellBirdID')

In [7]:
CSVs = list(DSLOC.glob('csv/*.csv'))
len(CSVs), CSVs[:3]

(20,
 [PosixPath('/mnt/cube/Datasets/StowellBirdID/csv/chiffchaff-acrossyear-bg-trn.csv'),
  PosixPath('/mnt/cube/Datasets/StowellBirdID/csv/pipit-withinyear-fg-tst.csv'),
  PosixPath('/mnt/cube/Datasets/StowellBirdID/csv/chiffchaff-acrossyear-bg-tst.csv')])

In [8]:
csv_df = pd.DataFrame(
    [csv.stem.split("-") + [csv] for csv in CSVs],
    columns=["species", "withinacross", "fgbg", "traintest", "csvloc"],
)
csv_df[:3]

Unnamed: 0,species,withinacross,fgbg,traintest,csvloc
0,chiffchaff,acrossyear,bg,trn,/mnt/cube/Datasets/StowellBirdID/csv/chiffchaf...
1,pipit,withinyear,fg,tst,/mnt/cube/Datasets/StowellBirdID/csv/pipit-wit...
2,chiffchaff,acrossyear,bg,tst,/mnt/cube/Datasets/StowellBirdID/csv/chiffchaf...


In [9]:
with Parallel(n_jobs=-1, verbose=10) as parallel:
    wav_df = parallel(
        delayed(parse_csv)(
            csvrow,
            DSLOC
        )
        for idx, csvrow in tqdm(csv_df.iterrows(), total=len(csv_df))
    );
wav_df = pd.concat(wav_df)

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.





[Parallel(n_jobs=-1)]: Done   3 out of  20 | elapsed:    5.2s remaining:   29.6s
[Parallel(n_jobs=-1)]: Done   6 out of  20 | elapsed:    5.4s remaining:   12.7s
[Parallel(n_jobs=-1)]: Done   9 out of  20 | elapsed:    5.8s remaining:    7.0s
[Parallel(n_jobs=-1)]: Done  12 out of  20 | elapsed:    6.0s remaining:    4.0s
[Parallel(n_jobs=-1)]: Done  15 out of  20 | elapsed:    6.6s remaining:    2.2s
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:    9.1s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   28.5s finished


In [10]:
print(len(wav_df))
display(wav_df[:3])

18110


Unnamed: 0,species,year,fgbg,trntst,indv,cutted,groundx,wavnum,wavloc
0,chiffchaff,acrossyear,bg,trn,F72726,cutted,bgx,0,/mnt/cube/Datasets/StowellBirdID/wav/chiffchaf...
1,chiffchaff,acrossyear,bg,trn,F72726,cutted,bgx,1,/mnt/cube/Datasets/StowellBirdID/wav/chiffchaf...
2,chiffchaff,acrossyear,bg,trn,F72726,cutted,bgx,2,/mnt/cube/Datasets/StowellBirdID/wav/chiffchaf...


In [11]:
np.sum(wav_df.trntst == 'trn')/len(wav_df)

0.7427388183324131

In [12]:
wav_df.year.unique()

array(['acrossyear', 'withinyear'], dtype=object)

In [13]:
wav_df.cutted.unique()

array(['cutted', 'pipit2017fg', 'littleowl2017bg', 'pipit2017bg',
       'littleowl2017fg', 'linhart2015marnosong'], dtype=object)

In [14]:
wav_df.species.unique()

array(['chiffchaff', 'pipit', 'littleowl'], dtype=object)

In [15]:
wav_df.fgbg.unique()

array(['bg', 'fg'], dtype=object)

### Find corresponding WAVs and noise

In [18]:
with Parallel(n_jobs=-1, verbose=10) as parallel:
    parallel(
        delayed(generate_json)(
            row, DT_ID, noise_indv_df=wav_df[(wav_df.indv == row.indv)]
        )
        for idx, row in tqdm(
            wav_df[wav_df.fgbg == "fg"].iterrows(),
            total=int(np.sum(wav_df.fgbg == "fg")),
        )
    )

HBox(children=(IntProgress(value=0, max=9148), HTML(value='')))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    5.3s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done  50 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done  65 tasks      | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Done  80 tasks      | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1991s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed:    5.6s
[Parallel(n_jobs=-1)]: Done 114 tasks      | elapsed:    5.8s
[Parallel(n_jobs=-1)]: Done 133 tasks      | elapsed:    5.9s
[Parallel(n_jobs=-1)]: Done 171 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done 213 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done 255 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 301 ta




[Parallel(n_jobs=-1)]: Done 9101 out of 9148 | elapsed:   52.0s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done 9148 out of 9148 | elapsed:   52.0s finished
