## Getting audio clips of required classes from google audioset data #### 
https://research.google.com/audioset/dataset/index.html
* Since the dataset is too huge, shortlisted a few classes to work on
    * Natural sounds: Fire vs Wind/Water/Storm.
* To get maximum data for the above classes, have used all the data available from the audioset data
    * Refer https://research.google.com/audioset//download.html#split for more details
    * ID to Class mapping is available in this link - https://github.com/audioset/ontology/blob/master/ontology.json

In [6]:
import os
import subprocess
import youtube_dl
import pandas as pd
import glob

In [13]:
!pwd

/home/sramirez/git/FeuerFreiKiller/data/external


In [18]:
os.chdir('/home/sramirez/git/FeuerFreiKiller/notebooks/')

In [21]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv -O '../data/external/unbalanced_data.csv'

--2019-08-12 13:53:31--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.168.176, 2a00:1450:4003:809::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.168.176|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101468408 (97M) [application/octet-stream]
Saving to: ‘../data/external/unbalanced_data.csv’


2019-08-12 13:55:07 (1,01 MB/s) - ‘../data/external/unbalanced_data.csv’ saved [101468408/101468408]



In [19]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/balanced_train_segments.csv -O '../data/external/balanced_train_data.csv'

--2019-08-12 13:53:14--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/balanced_train_segments.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.168.176, 2a00:1450:4003:809::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.168.176|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1211931 (1,2M) [application/octet-stream]
Saving to: ‘../data/external/balanced_train_data.csv’


2019-08-12 13:53:15 (2,85 MB/s) - ‘../data/external/balanced_train_data.csv’ saved [1211931/1211931]



In [20]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/eval_segments.csv -O '../data/external/eval_data.csv'

--2019-08-12 13:53:26--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/eval_segments.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.168.176, 2a00:1450:4003:809::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.168.176|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1143389 (1,1M) [application/octet-stream]
Saving to: ‘../data/external/eval_data.csv’


2019-08-12 13:53:27 (3,74 MB/s) - ‘../data/external/eval_data.csv’ saved [1143389/1143389]



## Merging the csv annotation files - Eval, Balanced and Unbalanced to get maximum data

In [22]:
#os.chdir('../data/external/')

In [31]:
path = '../data/external/'
all_files = glob.glob(path + "/*.csv")
li = []

for filename in all_files:
    print(filename)
    df = pd.read_csv(filename, skiprows=2, quotechar='"', engine='python', skipinitialspace=True)
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

../data/external/balanced_train_data.csv
../data/external/eval_data.csv
../data/external/unbalanced_data.csv


In [36]:
df.shape,  2042985 + 22176 + 20383

((2084320, 4), 2085544)

In [37]:
df.head()

Unnamed: 0,# YTID,start_seconds,end_seconds,positive_labels
0,--PJHxphWEs,30.0,40.0,"/m/09x0r,/t/dd00088"
1,--ZhevVpy1s,50.0,60.0,/m/012xff
2,--aE2O5G5WE,0.0,10.0,"/m/03fwl,/m/04rlf,/m/09x0r"
3,--aO5cdqSAg,30.0,40.0,"/t/dd00003,/t/dd00005"
4,--aaILOrkII,200.0,210.0,"/m/032s66,/m/073cg4"


In [40]:
df.duplicated().sum()

0

In [34]:
df.shape[0] == 2042985 + 22176 + 20383

False

In [41]:
df['label'] = df.positive_labels.map(lambda x: 'single' if len(x.split(',')) == 1 else 'multi')
print(df.label.value_counts())

multi     1183535
single     900785
Name: label, dtype: int64


## Filtering clips with required classes

In [109]:
# Use the corressponding ids based for the classes as available in json file mentioned at the start 
str2id = {'Fire': '/m/02_41', 'Wind': '/m/03m9d0z', 'Water': '/m/0838f', 'Thunderstorm': '/m/0jb2l'}
id2str = {v: k for k, v in str2id.items()}

for k in str2id.keys():
    df[k] = df.positive_labels.map(lambda x: 1 if (str2id[k] in x.split(',')) else 0)

In [110]:
for k in str2id.keys():
    print('Category: {}, # of elements: {}'.format(k, df[k].sum()))
    

Category: Fire, # of elements: 1445
Category: Wind, # of elements: 6805
Category: Water, # of elements: 8994
Category: Thunderstorm, # of elements: 1262


In [111]:
# Filter out those registers with value for either classes considered (output class: fire)
fw_df = df[df[str2id.keys()].sum(axis=1) > 0]
fw_df.shape

(18082, 10)

In [114]:
# Translate after selection
fw_df.translated_labels = fw_df.positive_labels.map(lambda x: ','.join([id2str[y] for y in x.split(',') if y in id2str.keys()]))

  


In [115]:
fw_df.translated_labels.value_counts()

Water                 8978
Wind                  6388
Thunderstorm          1233
Fire                  1059
Fire,Wind              381
Wind,Thunderstorm       27
Wind,Water               9
Fire,Water               5
Water,Thunderstorm       2
Name: positive_labels, dtype: int64

In [124]:
# Let's check an example for fire-water tuple

fw_df[fw_df.translated_labels == 'Fire,Water']


Unnamed: 0,# YTID,start_seconds,end_seconds,positive_labels,label,Fire,Wind,Water,Thunderstorm,Thunder
22956,1009ux1xbkg,0.0,10.0,"/m/02_41,/m/0838f",multi,1,0,1,0,0
25249,7Zdx0YrzHVk,20.0,30.0,"/m/02_41,/m/0838f",multi,1,0,1,0,0
1226923,ToKqR2NHqwQ,200.0,210.0,"/m/02_41,/m/0838f,/m/09x0r",multi,1,0,1,0,0
1237531,UB7upK3ZBsA,30.0,40.0,"/m/02_41,/m/07p9k1k,/m/0838f",multi,1,0,1,0,0
1894207,s6dbv2C2N8M,30.0,40.0,"/m/02_41,/m/0838f,/m/09x0r",multi,1,0,1,0,0


In [118]:
fw_df.head()

Unnamed: 0,# YTID,start_seconds,end_seconds,positive_labels,label,Fire,Wind,Water,Thunderstorm,Thunder
8,-0DdlOuIFUI,50.0,60.0,"/m/0130jx,/m/02jz0l,/m/0838f",multi,0,0,1,0,0
38,-5GhUbDLYkQ,16.0,26.0,"/m/0130jx,/m/02jz0l,/m/0838f",multi,0,0,1,0,0
71,-8_HpHg6nCw,30.0,40.0,"/m/07ptzwd,/m/0838f",multi,0,0,1,0,0
76,-99daJhXYJY,30.0,40.0,"/m/019jd,/m/02rlv9,/m/03m9d0z,/t/dd00092",multi,0,1,0,0,0
156,-JKLmqDk9p8,490.0,500.0,"/m/06mb1,/m/07r10fb,/m/0jb2l,/m/0ngt1,/t/dd00038",multi,0,0,0,1,1


In [119]:
fw_df.to_csv('../data/interim/firewind_dataset_links.csv')
fw_df.to_pickle('../data/interim/firewind_dataset_links.pickle')

## Download the 10 sec audio snippets for the filtered classes from youtube videos
* Using ffmpeg to get the audio and extract the requried 10s clip
* This part takes up lot of time as it involved downloading the entire video. Coudnt figure out a way to extract only the audio for a predefined time period. Any suggestions here would be very helpful

In [129]:
meta_df = pd.read_pickle('../data/interim/firewind_dataset_links.pickle')
clipsmeta = list(zip(meta_df['# YTID'].values, meta_df.start_seconds.values))

In [162]:
processed_path = '../data/processed/'
os.chdir(processed_path)

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/'

In [169]:
def download_audioset(metadata):
    i, start = metadata
    dur = 10
    print(i, start, dur)
    ydl_opts = {'format': 'bestaudio/best',
                'outtmpl': './{}'.format(i+ '.mp4'),
                'postprocessors': [{'key': 'FFmpegExtractAudio','preferredcodec': 'wav','preferredquality': '192'}],
                'prefer_ffmpeg': True,
                'keepvideo': True,
                'info_dict': {'start_time': start,
                'end_time': start + dur}}
    try:
        with youtube_dl.YoutubeDL(ydl_opts) as ydl:
            ydl.download(['http://www.youtube.com/watch?v={}'.format(i)]);
            #command = "ffmpeg -k -ss {} -t {} -i ./{}.wav ./{}.wav".format(start, dur, i,i+'_seg')
            #print(command)
            #subprocess.call(command, shell=True);
            #os.remove('./{}.wav'.format(i))
            #os.rename('./{}_seg.wav'.format(i),'./{}.wav'.format(i))
    except:
        print('Video not available{}'.format(i))

In [170]:
import multiprocessing as mp
mp.cpu_count()

with mp.Pool(8) as pool:
    pool.map(download_audioset, clipsmeta)

-0DdlOuIFUI 50.0 10
0uIe4dKUC1s 50.0 10
0NZLG7HMiLg 590.0 10
-2FVPRdbLdA 10.0 10
2u9Vh6Q-gKQ 0.0 10
4lke5p8dxI0 50.0 10
6Yv2uAr_54w 0.0 10
8QmwXbKIwUA 0.0 10
[youtube] -0DdlOuIFUI: Downloading webpage
[youtube] -2FVPRdbLdA: Downloading webpage
[youtube] 2u9Vh6Q-gKQ: Downloading webpage
[youtube] 0NZLG7HMiLg: Downloading webpage
[youtube] 0uIe4dKUC1s: Downloading webpage
[youtube] 8QmwXbKIwUA: Downloading webpage
[youtube] 4lke5p8dxI0: Downloading webpage
[youtube] 6Yv2uAr_54w: Downloading webpage
[youtube] 4lke5p8dxI0: Downloading video info webpage
[youtube] 0NZLG7HMiLg: Downloading video info webpage
[youtube] -2FVPRdbLdA: Downloading video info webpage
[youtube] 8QmwXbKIwUA: Downloading video info webpage
[youtube] 6Yv2uAr_54w: Downloading video info webpage
[youtube] 0uIe4dKUC1s: Downloading video info webpage
[youtube] 2u9Vh6Q-gKQ: Downloading video info webpage
[youtube] -0DdlOuIFUI: Downloading video info webpage


ERROR: This video contains content from AdonnanteTv. It is not available in your country.


Video not available4lke5p8dxI0
4ltnOsbLIJI 40.0 10
[youtube] 0uIe4dKUC1s: Downloading MPD manifest
[download] Destination: ./0NZLG7HMiLg.mp4
[download]   0.3% of 11.28MiB at  1.32MiB/s ETA 00:08[youtube] 4ltnOsbLIJI: Downloading webpage
[download] Destination: ./-2FVPRdbLdA.mp4
[download]   2.2% of 11.28MiB at  2.93MiB/s ETA 00:03[download] Destination: ./8QmwXbKIwUA.mp4
[download]   8.9% of 11.28MiB at 824.47KiB/s ETA 00:12[download] Destination: ./6Yv2uAr_54w.mp4
[download] Destination: ./2u9Vh6Q-gKQ.mp4
[download]  12.6% of 500.88KiB at 296.61KiB/s ETA 00:01[download] Destination: ./-0DdlOuIFUI.mp4
[download]   0.0% of 2.67MiB at 1012.00B/s ETA 46:07[youtube] 4ltnOsbLIJI: Downloading video info webpage
[download] 100% of 407.62KiB in 00:02
[download] 100% of 710.39KiB in 00:02
[download] 100% of 124.97KiB in 00:02
[download]  18.5% of 2.67MiB at 446.36KiB/s ETA 00:04[ffmpeg] Destination: ./6Yv2uAr_54w.wav
[ffmpeg] Correcting container in "./8QmwXbKIwUA.mp4"
6ZFCXGAbz3s 340.0 10
[ffm

ERROR: This video is unavailable.


Video not available4ltnOsbLIJI
4lwK3Ms-_DA 30.0 10
[download] 100% of 500.88KiB in 00:03
[download]  15.0% of 11.28MiB at 503.46KiB/s ETA 00:19[ffmpeg] Destination: ./2u9Vh6Q-gKQ.wav
[youtube] 4lwK3Ms-_DA: Downloading webpage
2uH6sLmJgA4 30.0 10
[youtube] 2uH6sLmJgA4: Downloading webpage
[download]  19.2% of 11.28MiB at 416.28KiB/s ETA 00:22[youtube] 6ZFCXGAbz3s: Downloading video info webpage
[youtube] 8RM4ExvUKoM: Downloading video info webpage
[download] 100% of 2.67MiB in 00:05
[download]  21.4% of 11.28MiB at 451.63KiB/s ETA 00:20[youtube] 2uH6sLmJgA4: Downloading video info webpage
[ffmpeg] Correcting container in "./-0DdlOuIFUI.mp4"
[ffmpeg] Destination: ./-0DdlOuIFUI.wav
-5GhUbDLYkQ 16.0 10
[youtube] -2NNI-DKgkI: Downloading video info webpage
[download]  30.7% of 11.28MiB at 547.00KiB/s ETA 00:14[youtube] -5GhUbDLYkQ: Downloading webpage
[download] Destination: ./-2NNI-DKgkI.mp4
[download]   7.3% of 3.41MiB at  4.31MiB/s ETA 00:00[download] Destination: ./2uH6sLmJgA4.mp4
[down

KeyboardInterrupt: 

In [None]:
# some clips will be missing due to unavailability of video
print(len(os.listdir('./')), len(clipsmeta))

In [44]:
fnames = [i[:-4] for i in os.listdir('./')]

In [48]:
BMdf.columns

Index(['# YTID', 'start_seconds', 'end_seconds', 'positive_labels', 'type',
       'Boat', 'Motorcycle', 'Racecar', 'Helicopter', 'Railroadcar'],
      dtype='object')

In [49]:
BMdf.columns = ['YTID', 'start_seconds', 'end_seconds', 'positive_labels', 'type','Boat', 'Motorcycle', 'Racecar', 'Helicopter', 'Railroadcar']

In [55]:
# Saving the metadata to generate labels.csv file later
BMdf[BMdf['YTID'].isin(fnames)].to_csv('./Vehicle_clips_metadata', sep=' ', index=False)