# Creating an Audioset subset

First, we need to download the full [Audioset](https://research.google.com/audioset/download.html) dataset. This is about 2.5GB so it may take some time to download.

### NOTE: The Audioset data is stored with filenames that are case sensitive. If you are using a filesystem with case-insensitive filenames (such as macOS) 75% of the dataset will be overwritten when you decompress the archive. You should only run this on a Linux machine.

In [1]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz

--2019-07-20 14:34:47--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.20.128, 2607:f8b0:400e:c09::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.20.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2588881044 (2.4G) [application/octet-stream]
Saving to: ‘features.tar.gz’


2019-07-20 14:35:33 (55.2 MB/s) - ‘features.tar.gz’ saved [2588881044/2588881044]



In [2]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv

--2019-07-20 14:35:42--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.142.128, 2607:f8b0:400e:c08::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.142.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101468408 (97M) [application/octet-stream]
Saving to: ‘unbalanced_train_segments.csv’


2019-07-20 14:35:42 (210 MB/s) - ‘unbalanced_train_segments.csv’ saved [101468408/101468408]



In [3]:
!tar xvzf features.tar.gz

audioset_v1_embeddings/
audioset_v1_embeddings/eval/
audioset_v1_embeddings/eval/W8.tfrecord
audioset_v1_embeddings/eval/L3.tfrecord
audioset_v1_embeddings/eval/KO.tfrecord
audioset_v1_embeddings/eval/xt.tfrecord
audioset_v1_embeddings/eval/Ul.tfrecord
audioset_v1_embeddings/eval/2T.tfrecord
audioset_v1_embeddings/eval/bC.tfrecord
audioset_v1_embeddings/eval/sG.tfrecord
audioset_v1_embeddings/eval/JB.tfrecord
audioset_v1_embeddings/eval/oU.tfrecord
audioset_v1_embeddings/eval/D1.tfrecord
audioset_v1_embeddings/eval/ph.tfrecord
audioset_v1_embeddings/eval/qA.tfrecord
audioset_v1_embeddings/eval/1v.tfrecord
audioset_v1_embeddings/eval/Vu.tfrecord
audioset_v1_embeddings/eval/70.tfrecord
audioset_v1_embeddings/eval/mV.tfrecord
audioset_v1_embeddings/eval/aK.tfrecord
audioset_v1_embeddings/eval/Ph.tfrecord
audioset_v1_embeddings/eval/is.tfrecord
audioset_v1_embeddings/eval/ka.tfrecord
audioset_v1_embeddings/eval/Jk.tfrecord
audioset_v1_embeddings/eval/e0.tfrecord
audioset_v1_embeddings/eval

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import pandas as pd
import glob


In order to run this on a linux machine, I used this notebook in [Google Colaboratory](https://colab.research.google.com/)
The following cells are for the uploading of necessary files to the colab instance.
If you are running this locally, you can skip this section

In [6]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  

In [0]:
for k,v in uploaded.items():
  with open(k,'wb') as f:
    f.write(v)

In [8]:
uploaded.keys()

dict_keys([])

Here is where the laughter and not laughter labels are loaded from local .csv files.
If you want to create a subset using a different category of labels, create new files containing a list of the labels you want to select as your positive and negative classes. You can find a list of all the Audioset labels [here](class_labels_indices.csv)

In [5]:
labels = pd.read_csv('unbalanced_train_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
laugh_labels = pd.read_csv('laugh_labels.csv',names=['num','label','description'])
not_laugh_labels = pd.read_csv('human_non_laugh_labels.csv',names=['num','label','description'])
l_str = '|'.join(laugh_labels['label'].values)
print(l_str)
labels.head()

/m/01j3sz|/t/dd00001|/m/07r660_|/m/07s04w4|/m/07sq110|/m/07rgt08


Unnamed: 0,# YTID,start_seconds,end_seconds,positive_labels
0,---1_cCGK4M,0.0,10.0,"/m/01g50p,/m/0284vy3,/m/06d_3,/m/07jdr,/m/07rwm0c"
1,---2_BBVHAA,30.0,40.0,/m/09x0r
2,---B_v8ZoBY,30.0,40.0,/m/04rlf
3,---EDNidJUA,30.0,40.0,"/m/02qldy,/m/02zsn,/m/05zppz,/m/09x0r"
4,---N4cFAE1A,21.0,31.0,"/m/04rlf,/m/09x0r"


In [6]:
n_str = '|'.join(not_laugh_labels['label'].values)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))
labels.head()
print(labels.not_laughter.sum()/ len(labels)) #about 49% of the videos did not have any laughter classes

0.49183779518843523


## Eval set

In [13]:
%%time
labels = pd.read_csv('eval_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
labels['laughter'] = labels['positive_labels'].str.contains(l_str)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))


positive = labels[labels['laughter']==True]
negative = labels[labels['not_laughter']==True].sample(positive.shape[0])
subset = positive.append(negative)
subset.to_csv('eval_laugh_speech_training_subset.csv')
print(subset.shape[0])


files = glob.glob('audioset_v1_embeddings/eval/*')
subset_ids = subset['# YTID'].values

#this loops through the actual tf record embedding files for the evaluation subset and puts
#them into eval_laugh_speech_subset.tfrecord file
i=0
writer = tf.python_io.TFRecordWriter('eval_laugh_speech_subset.tfrecord')
for tfrecord in files:
    for example in tf.python_io.tf_record_iterator(tfrecord):
        tf_example = tf.train.Example.FromString(example)
        vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding = 'UTF-8')
        if vid_id in subset_ids:
            writer.write(example)
            i+=1
print(i)

writer.close()

W0720 02:44:22.222311 140260003391360 deprecation.py:323] From <timed exec>:21: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


586
586
CPU times: user 28 s, sys: 984 ms, total: 28.9 s
Wall time: 29 s


In [0]:
#investigating the dataframes produced above
positive.head()
#positive.shape


In [0]:
from google.colab import files
#files.download('bal_laugh_speech_subset.tfrecord')
files.download('eval_laugh_speech_subset.tfrecord')

## train set

In [7]:
!pip install tqdm
from tqdm import tqdm



#### Warning: The audioset dataset is large and this will take a while to run. It took about 2 hours to process.

In [8]:
%%time
labels = pd.read_csv('unbalanced_train_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
n_str = '|'.join(not_laugh_labels['label'].values)
labels['laughter'] = labels['positive_labels'].str.contains(l_str)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))

positive = labels[labels['laughter']==True]
negative = labels[labels['not_laughter']==True].sample(positive.shape[0])
subset = positive.append(negative)
subset.to_csv('laugh_speech_training_subset.csv')

print(subset.shape[0])

import glob
files = glob.glob('audioset_v1_embeddings/unbal_train/*')
subset_ids = subset['# YTID'].values

i=0
writer = tf.python_io.TFRecordWriter('bal_laugh_speech_subset.tfrecord')
for tfrecord in tqdm(files):
    for example in tf.python_io.tf_record_iterator(tfrecord):
        tf_example = tf.train.Example.FromString(example)
        vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding = 'UTF-8')
        if vid_id in subset_ids:
            writer.write(example)
            i+=1
print(i)

writer.close()
print(writer)

W0720 14:37:56.061984 140353652848512 deprecation.py:323] From <timed exec>:20: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


18768


100%|██████████| 4096/4096 [35:08<00:00,  1.42it/s]

18768
<tensorflow.python.lib.io.tf_record.TFRecordWriter object at 0x7fa6568be6a0>
CPU times: user 35min 9s, sys: 9.98 s, total: 35min 19s
Wall time: 35min 15s





This is only necessary if you want to download from Colab

In [0]:
from google.colab import files
files.download('bal_laugh_speech_subset.tfrecord')

In [0]:
files.download('eval_laugh_speech_subset.tfrecord')