<a href="https://colab.research.google.com/github/satvik94/Embeddings-from-VGGish/blob/master/VGGish_Embeddings_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Extracting Audio Embeddings through VGGish

This colab extracts audio embeddings of sound files through VGGish. Section 1 imports the VGGish System. Section 2 extracts the audio embeddings of sound files into a numpy array. 

#1. Importing and Testing the VGGish System

Based on the directions at: https://github.com/tensorflow/models/tree/master/research/audioset

In [19]:
!pip install numpy scipy
!pip install resampy tensorflow six
!pip install tf_slim
!pip install soundfile



In [20]:
!git clone https://github.com/tensorflow/models.git

fatal: destination path 'models' already exists and is not an empty directory.


In [21]:
# Check to see where are in the kernel's file system.
!pwd

/content


In [22]:
# Grab the VGGish model
!curl -O https://storage.googleapis.com/audioset/vggish_model.ckpt
!curl -O https://storage.googleapis.com/audioset/vggish_pca_params.npz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  277M  100  277M    0     0   104M      0  0:00:02  0:00:02 --:--:--  104M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 73020  100 73020    0     0   432k      0 --:--:-- --:--:-- --:--:--  432k


In [23]:
# Make sure we got the model data.
!ls

embeddings.npy	  vggish_inference_demo.py  vggish_postprocess.pyc
mel_features.py   vggish_input.py	    vggish_slim.py
mel_features.pyc  vggish_input.pyc	    vggish_slim.pyc
models		  vggish_model.ckpt	    vggish_smoke_test.py
README.md	  vggish_params.py	    vggish_smoke_test.pyc
sample_data	  vggish_params.pyc	    vggish_train_demo.py
sounds		  vggish_pca_params.npz
sounds.zip	  vggish_postprocess.py


In [24]:
# Verify the location of the VGGish source files
!ls models/research/audioset/vggish

mel_features.py		  vggish_input.py	 vggish_slim.py
README.md		  vggish_params.py	 vggish_smoke_test.py
vggish_inference_demo.py  vggish_postprocess.py  vggish_train_demo.py


In [0]:
# Copy the source files to the current directory.
!cp models/research/audioset/vggish/* .

In [26]:
# Make sure the source files got copied correctly.
!ls

embeddings.npy	  vggish_inference_demo.py  vggish_postprocess.pyc
mel_features.py   vggish_input.py	    vggish_slim.py
mel_features.pyc  vggish_input.pyc	    vggish_slim.pyc
models		  vggish_model.ckpt	    vggish_smoke_test.py
README.md	  vggish_params.py	    vggish_smoke_test.pyc
sample_data	  vggish_params.pyc	    vggish_train_demo.py
sounds		  vggish_pca_params.npz
sounds.zip	  vggish_postprocess.py


In [0]:
# Run the test, which also loads all the necessary functions.
from vggish_smoke_test import *

In [0]:
import vggish_slim
import vggish_params
import vggish_input
import soundfile as sf

def CreateVGGishNetwork(hop_size=0.96):   # Hop size is in seconds.
  """Define VGGish model, load the checkpoint, and return a dictionary that points
  to the different tensors defined by the model.
  """
  vggish_slim.define_vggish_slim()
  checkpoint_path = 'vggish_model.ckpt'
  vggish_params.EXAMPLE_HOP_SECONDS = hop_size
  
  vggish_slim.load_vggish_slim_checkpoint(sess, checkpoint_path)

  features_tensor = sess.graph.get_tensor_by_name(
      vggish_params.INPUT_TENSOR_NAME)
  embedding_tensor = sess.graph.get_tensor_by_name(
      vggish_params.OUTPUT_TENSOR_NAME)

  layers = {'conv1': 'vggish/conv1/Relu',
            'pool1': 'vggish/pool1/MaxPool',
            'conv2': 'vggish/conv2/Relu',
            'pool2': 'vggish/pool2/MaxPool',
            'conv3': 'vggish/conv3/conv3_2/Relu',
            'pool3': 'vggish/pool3/MaxPool',
            'conv4': 'vggish/conv4/conv4_2/Relu',
            'pool4': 'vggish/pool4/MaxPool',
            'fc1': 'vggish/fc1/fc1_2/Relu',
            'fc2': 'vggish/fc2/Relu',
            'embedding': 'vggish/embedding',
            'features': 'vggish/input_features',
         }
  g = tf.get_default_graph()
  for k in layers:
    layers[k] = g.get_tensor_by_name( layers[k] + ':0')
    
  return {'features': features_tensor,
          'embedding': embedding_tensor,
          'layers': layers,
         }

In [0]:
def EmbeddingsFromVGGish(vgg, x, sr):
  '''Run the VGGish model, starting with a sound (x) at sample rate
  (sr). Return a dictionary of embeddings from the different layers
  of the model.'''
  # Produce a batch of log mel spectrogram examples.
  input_batch = vggish_input.waveform_to_examples(x, sr)
  # print('Log Mel Spectrogram example: ', input_batch[0])

  layer_names = vgg['layers'].keys()
  tensors = [vgg['layers'][k] for k in layer_names]
  
  results = sess.run(tensors,
                     feed_dict={vgg['features']: input_batch})

  resdict = {}
  for i, k in enumerate(layer_names):
    resdict[k] = results[i]
    
  return resdict

# 2. Extracting Audio Embeddings from VGGish

In [0]:
# Creating the network
import tensorflow as tf
tf.reset_default_graph()
sess = tf.Session()
"""
The following number `t` represents how many seconds is fed into VGGish.
If `t` is less than the duration `d` of the sound file, the sound file is divided 
into sections of `t` and separate embeddings are extracted. 
The remainder that is less than `t` is excluded. 
"""
vgg = CreateVGGishNetwork(2.5) # `t` = 2.5


In [31]:
""" 
Import data into Colab.
Load the wav files as a zip file into PWD. Name the zip file as 'sounds.zip'.
"""
from zipfile import ZipFile
zip_name = "sounds.zip"

with ZipFile(zip_name, 'r') as zip:
  zip.extractall('sounds')
  print("Extracted all sound files into the folder named 'sounds'!!")

Extracted all sound files into the folder named 'sounds'!!


In [32]:
"""
Load all the files into a list.
"""
import glob
sounds = glob.glob('sounds/*.wav')
print("The contents of the list are: {}".format(sounds))

The contents of the list are: ['sounds/amal.wav', 'sounds/acomic.wav']


In [36]:
"""
This code cell extracts embeddings from the sound files added to the list. 
All the embeddings are stored in the numpy array `em`.
"""

# Extract embeddings from first sound file.
print("Extracting embeddings from " + sounds[0])
in_signal, in_sr = sf.read(sounds[0])
resdict = EmbeddingsFromVGGish(vgg, in_signal, in_sr)
em0 = resdict['embedding']
em = np.copy(em0)

# Extract embeddings from remaining files.
for s in sounds[1:]:
  print("Extracting embeddings from " + s)
  in_signal, in_sr = sf.read(s)
  resdict = EmbeddingsFromVGGish(vgg, in_signal, in_sr)
  em_s = resdict['embedding']
  em = np.concatenate((em, em_s), axis = 0) #em concatenates all the values of embeddings.

Extracting embeddings from sounds/amal.wav
Extracting embeddings from sounds/acomic.wav


In [0]:
"""
Store the audio embeddings into a file. This file can be copied into your hard disk for later use.
"""
np.save("embeddings.npy", em)


In [38]:
"""
Load and test the numpy arrays
"""
em_load = np.load("embeddings.npy")
print("em_load shape is {}".format(em_load.shape))

print("Has the storing and loading worked correctly? {}".format(np.array_equal(em, em_load)))

em_load shape is (30, 128)
Has the storing and loading worked correctly? True
