# Data Generator Speed/Benchmarking 
This notebook takes our DataGenerator object, constructs a DataGenerator object, and then measures the speed of generating batches of data. As you navigate through the notebook you can also find average speed found for small samples as well as individual batches. If you have not yet setup the DataGenerator you may want to follow that notebook first, titled generator_test.ipynb, or you can choose to read along this one.

Use the following commands to install the two libraries. ffmpeg is not used at this time but may be necessary later. Librosa allows us to easily read and process audio data.

In [None]:
!pip install librosa




In [None]:
!pip install ffmpeg


Collecting ffmpeg
  Downloading https://files.pythonhosted.org/packages/f0/cc/3b7408b8ecf7c1d20ad480c3eaed7619857bf1054b690226e906fdf14258/ffmpeg-1.4.tar.gz
Building wheels for collected packages: ffmpeg
  Building wheel for ffmpeg (setup.py) ... [?25l[?25hdone
  Created wheel for ffmpeg: filename=ffmpeg-1.4-cp36-none-any.whl size=6083 sha256=926ca2dd1f5bc687a09da0e630045add667bc9e5bf85a5fe2c49d9014a1a6a72
  Stored in directory: /root/.cache/pip/wheels/b6/68/c3/a05a35f647ba871e5572b9bbfc0b95fd1c6637a2219f959e7a
Successfully built ffmpeg
Installing collected packages: ffmpeg
Successfully installed ffmpeg-1.4


Verify directory structure and create directory for audio data.

In [None]:
ls

[0m[01;34msample_data[0m/


In [None]:
mkdir grace_data

In [None]:
ls

In [None]:
cd grace_data/

/content/grace_data


Use the following two commands to unpack the Amazing Grace data from the below URL and into the current directory, creating a subdirectory under grace_data called amazing_grace.

In [None]:
!wget https://ccrma.stanford.edu/damp/performances/amazing_grace/amazing_grace.tar.gz

--2020-10-16 07:11:33--  https://ccrma.stanford.edu/damp/performances/amazing_grace/amazing_grace.tar.gz
Resolving ccrma.stanford.edu (ccrma.stanford.edu)... 171.64.197.141
Connecting to ccrma.stanford.edu (ccrma.stanford.edu)|171.64.197.141|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18757937250 (17G) [application/x-gzip]
Saving to: ‘amazing_grace.tar.gz’


2020-10-16 07:21:25 (30.2 MB/s) - ‘amazing_grace.tar.gz’ saved [18757937250/18757937250]



In [None]:
!tar xvzf amazing_grace.tar.gz;

Verify directory structure and presence of data.

In [None]:
%ls

[0m[01;34mamazing_grace[0m/     amazing_grace.midi    amazing_grace.tsv
amazing_grace.m4a  amazing_grace.tar.gz


In [None]:
cd amazing_grace

/content/grace_data/amazing_grace


In [None]:
%ls

Import some more necessary libraries. If you are running this code locally it is assumed you have these installed but if not then use pip and install these popular libraries.

In [None]:
import numpy as np
import keras

import random
import librosa
from os import listdir
from os.path import isfile, join

Create list of files from the data and verify that list contains correct content.

In [None]:
file_list = [f for f in listdir(".") if isfile(join(".", f))]

In [None]:
cd ..

/content/grace_data


In [None]:
cd ..

/content


Below is the script for the Data Generator. It takes in parameters corresponding to the data and then generates batches of data upon call (our batch size is set to one since we are dealing with audio data as described above, in the future we may look into reshaping methods).

We are not generating any test data right now, as you can see with the commented out 'y' variable and corresponding code. This would generate labels along with the training data.

In [None]:
import numpy as np
import keras

import random
import librosa
from os import listdir
from os.path import isfile, join

class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, path_prefix, list_IDs, labels, batch_size, dim, n_channels,
                 n_classes, shuffle):
        'Initialization'
        self.dim = dim
        self.batch_size = batch_size
        self.labels = labels
        self.list_IDs = list_IDs
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.prefix = path_prefix
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X = self.__data_generation(list_IDs_temp) #,y when other stuff is uncommented

        return X#, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        
        curr_len = 0
        for i,ID in enumerate(list_IDs_temp):
          print(ID)
          curr_len = len(self.load_audio('grace_data/amazing_grace/'+ID))
        print(curr_len)
        X = np.empty((self.batch_size, curr_len))
        #y = np.empty((self.batch_size), dtype=int)

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            #self.load_audio('grace_data/amazing_grace/312087870_108215812.m4a')
            X[i,] = self.load_audio(self.prefix + "/" + ID)

            # Store class
            #y[i] = self.labels[ID]

        return X#, keras.utils.to_categorical(y, num_classes=self.n_classes)

    def load_audio(self,audio_file_path):
        """Load audio to numpy array and return it
        """
        x,sr = librosa.load(audio_file_path, sr = None)
        return x

Construct a data generator with parameters corresponding to data, set labels to just one for now since we are not performing any classification at this moment.

In [None]:
dg = DataGenerator("grace_data/amazing_grace",file_list,["grace"]*len(file_list),1,(1,len(file_list)), 1,1,True)

# Benchmark Speed
We will now measure the speed that it takes to generate batches using our Data Generator by calling the getItem method with multiple batches and making use of the Python Time library.

Import Python time library and test that we are able to use it with seconds.

In [None]:
import time

In [None]:
time.time()

1602835102.1515386

In [None]:
time.time()

1602835108.6770604

Now let's measure the time it takes to generate one single batch.

In [None]:
start_time = time.time()
dg.__getitem__(0)
end_time = time.time()
print("Process took %.2f seconds" % (end_time-start_time))

263008531_68996947.m4a
8395840
Process took 1.80 seconds


In [None]:
start_time = time.time()
dg.__getitem__(1030)
end_time = time.time()
print("Process took %.2f seconds" % (end_time-start_time))

286551583_103860735.m4a
4215168
Process took 1.27 seconds


Now let us loop through multiple batches and measure time for each. Let's also calculate the average time it takes to generate a single batch of data from this small sample.

In [None]:
total_time = 0
total_ct = 0
for i in range(30,51):
  start_time = time.time()
  dg.__getitem__(i)
  end_time = time.time()
  print("Process took %.2f seconds" % (end_time-start_time))
  total_time += (end_time-start_time)
  total_ct +=1
print("Average time to generate a single batch: %.2f" % (total_time/total_ct))

524500461_243835078.m4a
4215424
Process took 1.10 seconds
172067295_37545021.m4a
4214912
Process took 1.05 seconds
544655410_243853092.m4a
9152192
Process took 1.63 seconds
459985479_222358934.m4a
2732928
Process took 1.05 seconds
443509753_173808619.m4a
9163456
Process took 1.54 seconds
312874508_120588370.m4a
3397376
Process took 1.03 seconds
372746313_126439607.m4a
4172672
Process took 1.09 seconds
67640440_62343015.m4a
4164992
Process took 1.12 seconds
307388511_91366358.m4a
4214912
Process took 1.11 seconds
269977099_78149466.m4a
4214656
Process took 1.11 seconds
54845271_125479988.m4a
4215424
Process took 1.10 seconds
295305119_87547844.m4a
4214912
Process took 1.08 seconds
65864893_3244916.m4a
4214912
Process took 1.07 seconds
313056930_93384454.m4a
4214656
Process took 1.02 seconds
484314656_195187742.m4a
4214528
Process took 1.05 seconds
254222358_70414312.m4a
4215424
Process took 1.04 seconds
242145147_138948490.m4a
1951488
Process took 0.70 seconds
133273365_29661618.m4a
421