# Creating an Audioset subset

First, we need to download the full [Audioset](https://research.google.com/audioset/download.html) dataset. This is about 2.5GB so it may take some time to download.

### NOTE: The Audioset data is stored with filenames that are case sensitive. If you are using a filesystem with case-insensitive filenames (such as macOS) 75% of the dataset will be overwritten when you decompress the archive. You should only run this on a Linux machine.

In [1]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz

--2019-07-03 02:06:51--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.20.128, 2607:f8b0:400e:c08::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.20.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2588881044 (2.4G) [application/octet-stream]
Saving to: ‘features.tar.gz’


2019-07-03 02:07:44 (52.5 MB/s) - ‘features.tar.gz’ saved [2588881044/2588881044]



In [2]:
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv

--2019-07-03 02:09:31--  http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.135.128, 2607:f8b0:400e:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.135.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101468408 (97M) [application/octet-stream]
Saving to: ‘unbalanced_train_segments.csv’


2019-07-03 02:09:37 (23.9 MB/s) - ‘unbalanced_train_segments.csv’ saved [101468408/101468408]



In [3]:
!tar xvzf features.tar.gz

audioset_v1_embeddings/unbal_train/HV.tfrecord
audioset_v1_embeddings/unbal_train/S0.tfrecord
audioset_v1_embeddings/unbal_train/bo.tfrecord
audioset_v1_embeddings/unbal_train/VL.tfrecord
audioset_v1_embeddings/unbal_train/kN.tfrecord
audioset_v1_embeddings/unbal_train/Wg.tfrecord
audioset_v1_embeddings/unbal_train/5j.tfrecord
audioset_v1_embeddings/unbal_train/nW.tfrecord
audioset_v1_embeddings/unbal_train/uo.tfrecord
audioset_v1_embeddings/unbal_train/fT.tfrecord
audioset_v1_embeddings/unbal_train/jZ.tfrecord
audioset_v1_embeddings/unbal_train/nU.tfrecord
audioset_v1_embeddings/unbal_train/kI.tfrecord
audioset_v1_embeddings/unbal_train/yG.tfrecord
audioset_v1_embeddings/unbal_train/BA.tfrecord
audioset_v1_embeddings/unbal_train/Lq.tfrecord
audioset_v1_embeddings/unbal_train/Mu.tfrecord
audioset_v1_embeddings/unbal_train/cU.tfrecord
audioset_v1_embeddings/unbal_train/f3.tfrecord
audioset_v1_embeddings/unbal_train/p-.tfrecord
audioset_v1_embeddings/unbal_train/dp.tfrecord
audioset_v1_e

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import pandas as pd
import glob


In order to run this on a linux machine, I used this notebook in [Google Colaboratory](https://colab.research.google.com/)
The following cells are for the uploading of necessary files to the colab instance.
If you are running this locally, you can skip this section

In [6]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  

In [0]:
for k,v in uploaded.items():
  with open(k,'wb') as f:
    f.write(v)

In [8]:
uploaded.keys()

dict_keys([])

Here is where the laughter and not laughter labels are loaded from local .csv files.
If you want to create a subset using a different category of labels, create new files containing a list of the labels you want to select as your positive and negative classes. You can find a list of all the Audioset labels [here](class_labels_indices.csv)

In [0]:
labels = pd.read_csv('unbalanced_train_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
laugh_labels = pd.read_csv('laugh_labels.csv',names=['num','label','description'])
not_laugh_labels = pd.read_csv('human_non_laugh_labels.csv',names=['num','label','description'])
l_str = '|'.join(laugh_labels['label'].values)

In [0]:
n_str = '|'.join(not_laugh_labels['label'].values)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))

## Eval set

In [12]:
%%time
labels = pd.read_csv('eval_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
labels['laughter'] = labels['positive_labels'].str.contains(l_str)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))


positive = labels[labels['laughter']==True]
negative = labels[labels['not_laughter']==True].sample(positive.shape[0])
subset = positive.append(negative)
subset.to_csv('eval_laugh_speech_training_subset.csv')
print(subset.shape[0])


files = glob.glob('audioset_v1_embeddings/eval/*')
subset_ids = subset['# YTID'].values

i=0
writer = tf.python_io.TFRecordWriter('eval_laugh_speech_subset.tfrecord')
for tfrecord in files:
    for example in tf.python_io.tf_record_iterator(tfrecord):
        tf_example = tf.train.Example.FromString(example)
        vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding = 'UTF-8')
        if vid_id in subset_ids:
            writer.write(example)
            i+=1
print(i)

writer.close()

W0703 02:56:43.392501 139973038393216 deprecation.py:323] From <timed exec>:19: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


586
586
CPU times: user 20.1 s, sys: 1.11 s, total: 21.2 s
Wall time: 21.3 s


## train set

In [13]:
!pip install tqdm
from tqdm import tqdm



#### Warning: The audioset dataset is large and this will take a while to run. It took about 2 hours to process.

In [0]:
%%time
labels = pd.read_csv('unbalanced_train_segments.csv',header=2, quotechar=r'"',skipinitialspace=True)
n_str = '|'.join(not_laugh_labels['label'].values)
labels['laughter'] = labels['positive_labels'].str.contains(l_str)
labels['not_laughter'] = (labels['positive_labels'].str.contains(n_str) & ~labels['positive_labels'].str.contains(l_str))

positive = labels[labels['laughter']==True]
negative = labels[labels['not_laughter']==True].sample(positive.shape[0])
subset = positive.append(negative)
subset.to_csv('laugh_speech_training_subset.csv')

print(subset.shape[0])

import glob
files = glob.glob('audioset_v1_embeddings/unbal_train/*')
subset_ids = subset['# YTID'].values

i=0
writer = tf.python_io.TFRecordWriter('bal_laugh_speech_subset.tfrecord')
for tfrecord in tqdm(files):
    for example in tf.python_io.tf_record_iterator(tfrecord):
        tf_example = tf.train.Example.FromString(example)
        vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding = 'UTF-8')
        if vid_id in subset_ids:
            writer.write(example)
            i+=1
print(i)

writer.close()

  0%|          | 0/4096 [00:00<?, ?it/s]

18768


 32%|███▏      | 1297/4096 [11:56<24:56,  1.87it/s]

This is only necessary if you want to download from Colab

In [0]:
from google.colab import files
files.download('bal_laugh_speech_subset.tfrecord')

In [0]:
files.download('eval_laugh_speech_subset.tfrecord')