## Getting audio clips of required classes from google audioset data #### 
https://research.google.com/audioset/dataset/index.html
* Since the dataset is too huge, shortlisted a few classes to work on
    * Bark(Dog) vs Meow(Cat) (5GB of data, ~4.5k 10s audio clips)
    * (Motorboat, speedboat) vs Motorcycle vs (Race car, auto racing) vs Helicopter vs (Railroad car, train wagon)(Close to 45GB of 10s audio data, ~33k clips even for these 5 classes) - This exp takes a lot of time given the data size
    * Change the classids as appropriate
* To get maximum data for the above classes, have used all the data available from the audioset data
    * Refer https://research.google.com/audioset//download.html#split for more details
    * ID to Class mapping is available in this link - https://github.com/audioset/ontology/blob/master/ontology.json

In [6]:
import os
import subprocess
import youtube_dl
import pandas as pd
import glob

## Merging the csv annotation files - Eval, Balanced and Unbalanced to get maximum data

In [3]:
# os.chdir('./data/google_audioset')

In [7]:
df = pd.DataFrame()
for cnt, i in enumerate(glob.glob('./*_segments.csv')):
    print(cnt, i)
    if cnt==0:
        x = pd.read_csv(i,# nrows=1000,
                        sep=',', skiprows=2,quotechar='"', engine='python', skipinitialspace=True)
        df = x
    else:
        x = pd.read_csv(i, #nrows=1000,
                        sep=',', skiprows=2,quotechar='"', engine='python', skipinitialspace=True)
        df = df.append(x, ignore_index=True)

0 ./eval_segments.csv
1 ./unbalanced_train_segments.csv
2 ./balanced_train_segments.csv


In [8]:
df.head()

Unnamed: 0,# YTID,start_seconds,end_seconds,positive_labels
0,--4gqARaEJE,0.0,10.0,"/m/068hy,/m/07q6cd_,/m/0bt9lr,/m/0jbk"
1,--BfvyPmVMo,20.0,30.0,/m/03l9g
2,--U7joUcTCo,0.0,10.0,/m/01b_21
3,--i-y1v8Hy8,0.0,9.0,"/m/04rlf,/m/09x0r,/t/dd00004,/t/dd00005"
4,-0BIyqJj9ZU,30.0,40.0,"/m/07rgt08,/m/07sq110,/t/dd00001"


In [9]:
df['type'] = df.positive_labels.map(lambda x: 'classification' if len(x.split(','))==1 else 'multi label')
print(df.type.value_counts())

multi label       1183535
classification     900785
Name: type, dtype: int64


## Filtering clips with required classes

In [22]:
# Use the corressponding ids based for the classes as available in json file mentioned at the start 
# c2code_map = {'Bark':'/m/05tny_', 'Meow':'/m/07qrkrw'}
c2code_map = {'Boat':'/m/02rlv9', 'Motorcycle':'/m/04_sv', 'Racecar':'/m/0ltv', 'Helicopter':'/m/09ct_', 'Railroadcar':'/m/01g50p'}

# # bark
# df['Bark']= df.positive_labels.map(lambda x: 1 if (c2code_map['Bark'] in x.split(',')) else 0)
# print(df[df.Bark==1].type.value_counts())
# # Meow
# df['Meow']= df.positive_labels.map(lambda x: 1 if (c2code_map['Meow'] in x.split(',')) else 0)
# print(df[df.Meow==1].type.value_counts())

# Boat
df['Boat']= df.positive_labels.map(lambda x: 1 if (c2code_map['Boat'] in x.split(',')) else 0)
print('Boat',df[df.Boat==1].type.value_counts())
# Motorcycle
df['Motorcycle']= df.positive_labels.map(lambda x: 1 if (c2code_map['Motorcycle'] in x.split(',')) else 0)
print(df[df.Motorcycle==1].type.value_counts())
# Racecar
df['Racecar']= df.positive_labels.map(lambda x: 1 if (c2code_map['Racecar'] in x.split(',')) else 0)
print('Racecar',df[df.Racecar==1].type.value_counts())
# Helicopter
df['Helicopter']= df.positive_labels.map(lambda x: 1 if (c2code_map['Helicopter'] in x.split(',')) else 0)
print('Helicopter',df[df.Helicopter==1].type.value_counts())
# Railroadcar
df['Railroadcar']= df.positive_labels.map(lambda x: 1 if (c2code_map['Railroadcar'] in x.split(',')) else 0)
print('Railroadcar',df[df.Railroadcar==1].type.value_counts())

Boat multi label       7588
classification     490
Name: type, dtype: int64
multi label       7019
classification     242
Name: type, dtype: int64
Racecar multi label       6310
classification     264
Name: type, dtype: int64
Helicopter multi label       2964
classification     734
Name: type, dtype: int64
Railroadcar multi label       8342
classification      19
Name: type, dtype: int64


* Most of the audio clips have more than one label and this makes processing for normal classifation difficult
* Filter those clips which are having only one of the shortlisted callses(e.g.either having dog bark or cat meow but not both)

In [31]:
# Clips where we have both
# df[(df.Bark==1) & (df.Meow==1)]
print(((df.Boat + df.Motorcycle + df.Racecar + df.Helicopter + df.Railroadcar)>1).sum())

# Clips where only one of the shortlisted classes is present in the audio
# BMdf = df[(df.Bark+df.Meow)==1]
BMdf = df[(df.Boat + df.Motorcycle + df.Racecar + df.Helicopter + df.Railroadcar)==1]

clipsmeta = list(zip(BMdf['# YTID'].values, BMdf.start_seconds.values))

145


## Download the 10 sec audio snippets for the filtered classes from youtube videos
* Using ffmpeg to get the audio and extract the requried 10s clip
* This part takes up lot of time as it involved downloading the entire video. Coudnt figure out a way to extract only the audio for a predefined time period. Any suggestions here would be very helpful

In [35]:
os.chdir('./google_audioset_Vehicles')

In [39]:
def download_audioset(metadata):
    i, start = metadata
    dur = 10
    print(i, start, dur)
    ydl_opts = {'format': 'bestaudio/best',
                'outtmpl': './{}'.format(i+'.mp4'),
                'postprocessors': [{'key': 'FFmpegExtractAudio','preferredcodec': 'wav','preferredquality': '192'}]}
    try:
        with youtube_dl.YoutubeDL(ydl_opts) as ydl:
            ydl.download(['http://www.youtube.com/watch?v={}'.format(i)])
            command = "ffmpeg -ss {} -t {} -i ./{}.wav ./{}.wav".format(start, dur, i,i+'_seg')
            print(command)
            subprocess.call(command, shell=True)
            os.remove('./{}.wav'.format(i))
            os.rename('./{}_seg.wav'.format(i),'./{}.wav'.format(i))
    except:
        print('Video not available{}'.format(i))

In [40]:
import multiprocessing as mp
mp.cpu_count()

with mp.Pool(8) as pool:
    pool.map(download_audioset, clipsmeta)

8

In [43]:
# some clips will be missing due to unavailability of video
print(len(os.listdir('./')), len(clipsmeta))

32659 33681


In [44]:
fnames = [i[:-4] for i in os.listdir('./')]

In [48]:
BMdf.columns

Index(['# YTID', 'start_seconds', 'end_seconds', 'positive_labels', 'type',
       'Boat', 'Motorcycle', 'Racecar', 'Helicopter', 'Railroadcar'],
      dtype='object')

In [49]:
BMdf.columns = ['YTID', 'start_seconds', 'end_seconds', 'positive_labels', 'type','Boat', 'Motorcycle', 'Racecar', 'Helicopter', 'Railroadcar']

In [55]:
# Saving the metadata to generate labels.csv file later
BMdf[BMdf['YTID'].isin(fnames)].to_csv('./Vehicle_clips_metadata', sep=' ', index=False)