# Introduction

The training data for the *Rainforest Connection Species Audio Detection* competition is presented in two formats.  The first data presentation is through a directory of .flac compressed audio files plus two .csv files containing the annotations for those files.  The audio files can be loaded, decoded, clipped and annotated using lookups to the .csv files.  The second data presentation is with a single directory of .tfrec files, where each .tfrec file contains many data records consisting of the audio signal and the annotation serialized into a [protocol buffer](https://developers.google.com/protocol-buffers/).  The .tfrec files can be directly loaded into `TFRecordDataset`s with annotations already attached.  While the first format may seem more straightforward to users unfamiliar with `TFRecordDataset`, use of the second format will reduce data preprocessing overhead and allow for more streamlined computation.  This notebook presents a data wrangling method to take advantage of the `TFRecordDataset` data presentation. 

In [None]:
import os
import tensorflow as tf

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr_or_assign"

from IPython.display import Markdown as md

# Mapping the Data Source

First we must map the `.tfrec` datafiles to a `TFRecordDataset`.

In [None]:
# Create TFRecordDataset from training files

TFREC_TRAIN_PATH = '/kaggle/input/rfcx-species-audio-detection/tfrecords/train'

datafiles  = os.listdir(TFREC_TRAIN_PATH)
raw_dataset = tf.data.TFRecordDataset([os.path.join(TFREC_TRAIN_PATH,x) for x in datafiles])

The `raw_dataset` created from the `tfrec` datafiles contains a single `tf.Tensor` of type `tf.string` for each audio recording in the training dataset.  Each of these strings is a [serialized representation of the data record](https://developers.google.com/protocol-buffers/) that contains the recording_id, the audio_wav, and the label_info (as specified in the competition [data description](https://www.kaggle.com/c/rfcx-species-audio-detection/data)).   

# Extract the Features
We must now extract the features from the serialized string into three distinct tf.Tensors using a dataset mapping.

In [None]:
# Add the feature labels to the raw dataset examples

feature_description = {
    'recording_id': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'audio_wav': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'label_info': tf.io.FixedLenFeature([],tf.string,default_value=''),
}

def _label_features(example_proto):
    # Parse using the above dictionary
    return tf.io.parse_single_example(example_proto, feature_description)

parsed_dataset = raw_dataset.map(_label_features)

Now with the `parsed_dataset` we can access the individuals features in a record.

In [None]:
next_example = next(iter(parsed_dataset))
print(next_example['label_info'])
print(next_example['recording_id'])

# Limitations of the Parsed Dataset

In the above `parsed_dataset` each record represents an audio recording, but as per the data specification, a recording could contain more than one training example.  The following code snippet lists the number of training examples examples for the first 10 audio recordings:

In [None]:
def examples_per_record(example_proto): 
    label = example_proto['label_info']
    labels = tf.strings.split(label,sep=";")
    return (example_proto['recording_id'],len(labels))

example_counts = parsed_dataset.map(examples_per_record)

print("(Recording ID, Number of Examples)")
audio_recordings_with_multiple_examples=[]
for ex in example_counts.take(10).as_numpy_iterator():
    print(ex)
    if ex[1]>1:
        audio_recordings_with_multiple_examples = audio_recordings_with_multiple_examples + [ex]

Note that some of the records displayed above contain multiple examples:

In [None]:
audio_recordings_with_multiple_examples

# Remapping to a More Intuitive Dataset Interface

We would like a dataset in which each record is a training example with the audio signal clipped to the range [t_min,t_max] as specified in the `label_info`.  Also it will be convenient for the audio signal to be represented as `tf.float32` rather than a byte string.  Finally, we'd like to be able to access the features annotated in the label info individually using keys.  

The following code performs another dataset mapping to extract and properly format these features.  The tf.dataset method `flat_map` is used to flatten the one-to-many mapping of `parsed_dataset` records that contain multiple examples.

In [None]:
# Map to a new dataset with eight features: 
#    recording_id (tf.string)
#    species_id (tf.int32)
#    songtype_id(tf.int32)
#    t_min(tf.float32)
#    f_min(tf.float32)
#    t_max(tf.float32)
#    f_max(tf.float32)
#    is_tp(tf.bool)
#    sample_rate(tf.int32)
#    signal(tf.float32)

def decode_audio(audio_binary):
    audio, sample_rate = tf.audio.decode_wav(audio_binary)
    return tf.squeeze(audio, axis=-1), sample_rate

def clip_signal_to_interval(signal,sr,tmin,tmax):
    sr = tf.cast(sr,tf.float32)
    return signal[tf.cast(sr*tmin,tf.int32):tf.cast(sr*tmax,tf.int32)]

def parse_label(example_proto): 
    recording_id = example_proto['recording_id']
    
    label = example_proto['label_info']
    labels = tf.strings.split(label,sep=";")
    labels=tf.strings.regex_replace(labels,'"','')
    labels=tf.strings.strip(labels)
    labels = tf.strings.split(labels,',')
    labels=tf.strings.to_number(labels)
    
    (signal,sample_rate) = decode_audio(example_proto['audio_wav'])
    # Create dataset from label_info
    # Label info keys:
    #     (species_id, songtype_id, t_min, f_min, t_max, f_max, is_tp)
    dataset = tf.data.Dataset.from_tensor_slices(labels)
    # Map to new dataset with recording_id and label_info keys
    dataset = dataset.map(lambda x: {'recording_id':recording_id, 
                                     'species_id':tf.cast(x[0],tf.int32),
                                     'songtype_id':tf.cast(x[1],tf.int32),
                                     't_min':x[2],
                                     'f_min':x[3],
                                     't_max':x[4],
                                     'f_max':x[5],
                                     'is_tp':tf.cast(x[6],tf.bool),
                                     'sample_rate':sample_rate,
                                     'signal':clip_signal_to_interval(signal,sample_rate,x[2],x[4]),
                                    })
    
    return dataset
    
    
    

dataset = parsed_dataset.flat_map(lambda x: parse_label(x))

Here's a look at the first 5 records in the new dataset:

In [None]:
k=0
for example in iter(dataset.take(5)):
    print('\n******',k,'*****\n',example)
    k+=1

We can also verify that the `parsed_dataset` records containing multiple examples were mapped to multiple records in `dataset`:

In [None]:
for recording_id, nexamples in audio_recordings_with_multiple_examples:
    print('***************{}**************'.format(recording_id))
    examples = dataset.filter(lambda x: x['recording_id']==recording_id).take(nexamples)
    for ex in examples:
        print('{},{},{},{}'.format(recording_id,ex['species_id'],ex['songtype_id'],ex['t_min']))

It is also much easier for us to plot the waveforms, since the records now contain uncompressed audio signals with `tf.float32` datatype and we can easily index into the signal and sampling rate.

In [None]:
# Plot the first 10 records
import matplotlib.pyplot as plt
import numpy as np

nrecs = 10

fig,ax_arr = plt.subplots(5,2,figsize=(24,36))


def plot_example(ax,example):
    signal = example['signal'].numpy()
    sample_rate = example['sample_rate'].numpy()
    t = np.linspace(0,len(signal)/sample_rate,len(signal))
    ax.plot(t,signal)
    ax.set_xlabel('time')
    ax.set_title('recording_id={recid},tstart={tstart}'.format(recid=example['recording_id'],tstart=example['t_min']))
    
ds_iter = iter(dataset)
for subarr in ax_arr:
    for ax in subarr:
        plot_example(ax,next(ds_iter))

plt.show()