# Feeding
When training some machine learning task, normally one has to provide a dataset to provide access to the samples in the training data. In a default case we need to have some input features and some targets. With audiomate such data is stored in a Container (for targets) or a FeatureContainer (for input features). The feeding module provides wrappers around containers as dataset-like objects, as they are often required by some machine/deep-learning framework.

In [1]:
import os
import numpy as np

import audiomate
from audiomate import corpus
from audiomate.corpus import assets
from audiomate.utils import units

In [2]:
urbansound8k_subset = audiomate.Corpus.load('data/urbansound_subset', reader='urbansound8k')

## Extract features/targets
First we need to extract the features and targets.

In [3]:
frame_settings = units.FrameSettings(2048, 1024)
sampling_rate = 16000

**Features**

In [4]:
from audiomate.processing import pipeline

feature_path = 'output/mel_features.hdf5'
os.makedirs('output', exist_ok=True)

mel_extractor = pipeline.MelSpectrogram(n_mels=23)
power_to_db = pipeline.PowerToDb(ref=np.max, parent=mel_extractor)

features = power_to_db.process_corpus(urbansound8k_subset, 
                                feature_path, 
                                frame_size=frame_settings.frame_size, 
                                hop_size=frame_settings.hop_size, 
                                sr=sampling_rate)

**Targets**

In [5]:
from audiomate import encoding

target_path = 'output/targets.hdf5'
os.makedirs('output', exist_ok=True)

labels = list(urbansound8k_subset.all_label_values(corpus.LL_SOUND_CLASS))
encoder = encoding.FrameHotEncoder(labels, corpus.LL_SOUND_CLASS, frame_settings, sampling_rate)
encodings = encoder.encode_corpus(urbansound8k_subset, target_path)

## Feeding single frames
In this scenario a single training sample consists of one frame.

In [6]:
from audiomate import feeding

input_container = assets.FeatureContainer('output/mel_features.hdf5')
target_container = assets.Container('output/targets.hdf5')

input_container.open()
target_container.open()

single_frame_dataset = feeding.FrameDataset(urbansound8k_subset, [input_container, target_container])

# Get a single sample, which is a tuple/list with input/target for the frame 3
single_frame_dataset[3]

[array([ -7.2746153,  -0.8263799,  -3.2943037,  -5.7100906,  -8.060489 ,
         -7.1336184, -11.988809 , -15.724296 , -12.836285 , -15.029507 ,
        -14.673418 , -15.978542 , -13.3668995, -12.036485 , -13.558588 ,
        -14.553731 , -14.105342 , -18.139305 , -16.73909  , -16.591326 ,
        -13.479118 , -15.261944 , -17.350136 ], dtype=float32),
 array([1., 0.], dtype=float32)]

Since index access to single frames of the underlying hdf5-files (via h5py) is quite slow, we can use a PartitioningIterator, that loads partitions of a given maximal size into memory. After all frames are iterate over, the next partition is loaded.

In [7]:
# Create via dataset
single_frame_iterator = single_frame_dataset.partitioned_iterator('250M', shuffle=True, seed=34)

# Or direct
single_frame_iterator = feeding.FrameIterator(urbansound8k_subset, [input_container, target_container], 
                                              '200M', shuffle=True, seed=23)

next(single_frame_iterator)

[array([-50.76419 , -45.63187 , -33.773666, -31.817165, -40.481903,
        -47.343697, -47.572548, -46.96832 , -50.65575 , -46.23703 ,
        -48.19282 , -49.336872, -48.22521 , -46.70622 , -50.95082 ,
        -59.49267 , -60.997467, -65.7069  , -64.58801 , -64.83406 ,
        -68.44708 , -68.750336, -69.54637 ], dtype=float32),
 array([0., 1.], dtype=float32)]

## Feeding Multiple Frames
In some cases a single training sample should be a sequence of frames. In this case the MultiFrameDataset can be used, which returns an array of frames.

In [8]:
from audiomate import feeding

input_container = assets.FeatureContainer('output/mel_features.hdf5')
target_container = assets.Container('output/targets.hdf5')

input_container.open()
target_container.open()

multi_frame_dataset = feeding.MultiFrameDataset(urbansound8k_subset, [input_container, target_container], 
                                               frames_per_chunk=4, return_length=True, pad=True)

# Get a single sample, which is a tuple/list with input/target chunk with index 3
sample = multi_frame_dataset[3]

print('Input feature shape of sample: {}'.format(str(sample[0].shape)))
print('Target shape of sample: {}'.format(str(sample[0].shape)))
print('Number of frames in sample: {}'.format(sample[2]))

Input feature shape of sample: (4, 23)
Target shape of sample: (4, 23)
Number of frames in sample: 4


As with the FrameDataset, for the MultiFrame case a partitioned iterator can be used in the same manner.

## Feeding full Utterances
For example for speech recognition our training samples are most likely full utterances. Furthermore the dimension of inputs and targets can differ. With the UtteranceDataset we can load full utterances as samples.

In [9]:
from audiomate import feeding

input_container = assets.FeatureContainer('output/mel_features.hdf5')
target_container = assets.Container('output/targets.hdf5')

input_container.open()
target_container.open()

utt_dataset = feeding.UtteranceDataset(urbansound8k_subset, [input_container, target_container], pad=True)

# Get a sample (input-feats-utt-1, len-input-feats-utt-1, targets-utt-1, len-target-utt-1)
sample = utt_dataset[1]
print('Input feature shape: {}'.format(str(sample[0].shape)))
print('Length of the input features: {}'.format(sample[1]))
print('Targets shape: {}'.format(str(sample[2].shape)))
print('Length of the targets: {}'.format(sample[3]))


Input feature shape: (62, 23)
Length of the input features: 62
Targets shape: (62, 2)
Length of the targets: 62
