# Tensorflow - Ready Notebook

BirdCLEF22 attempt. Did not catch a dataleak until the eleventh hour. Hopefully this public notebook will help others for BirdCLEF23. While this notebook is ready to train off the bat, I highly recommend that the training is done on an external computer. I did not have much luck with kaggle due to the lack of compute resources.

## High-level overview

I trust the user has done their own data exploration. 

1. Create a pd.Dataframe of audio files split into 5s chunks. This dataframe will be shuffled, and the filename + time information will be used to construct a spectrogram of the correct time.
2. Dataframe will be parsed into a tf.Dataset generator yielding stft + label pairs. There are several wrappers to force this process.
3. Augmentations will be applied via Dataset.map
4. Finish off by normalising + centering the bins in the spectrogram. This was accomplished by taking a log, scaling and shifting.

## Features

This notebook contains a few features that are easily customisable and usable in data projects of a similar scope.

### Augmentations

Standard augmentations for image processing. Most of these are given as tf.functions which can be executed in graph mode. Those that cannot will need to be wrapped using tf.py_function.

* Time shift
* White noise
* Pink noise
* Mixing of sounds

### Custom loss function

Binary crossentropy (BCE) suffers when the number present objects in its label vector is few relative to the length of the vector. While using BCE + sigmoid activation is nominally OK in this case (2-3 birds in ~20 vector), we have experimented in using a weighted categorical crossentropy function + softmax activation, where we scale $y_true$ by the number of positive classes. It is shown in https://arxiv.org/pdf/1805.00932.pdf that this may help the training process, however you will need another method to discern the number of objects in the inferrence stage (take top $N$ items from the predict vector).




## Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import librosa
import librosa.display
import IPython.display as ipd
import ast
import json
import glob
import os

import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import callbacks

print(tf.__version__)
print(np.__version__)
print(pd.__version__)

## Globals

In [None]:
"""
all constants here:
- audio constants, chunking parameters, etc
- stft constants, n_fft, hop size, etc.
- tensorflow constants: learning rate, layer depth, etc.
"""

def get_scored_birds(file_name = '../input/birdclef-2022/scored_birds.json'):
    with open(file_name) as sbfile:
        scored_birds = json.load(sbfile)
    return scored_birds

PATH_DIR = "../input/birdclef-2022/train_audio"
SEED = 42
SCORED_BIRDS = np.asarray(get_scored_birds())

#SINGLE AUDIO PARAMS
CHUNK_LENGTH = 5
SR = 32000
MIN_REMAINDER = 3
SEC_DUR = 30 #If audio duration > SEC_DUR, ignore secondary birds. Shorter audio files more likely to encapsulate secondary bird in side the 5s chunk.

#STFT GLOBALS
TOTAL_FRAMES = int(CHUNK_LENGTH * SR)
N_FFT = 2048
HOP_DIV = 4
FREQ_BANDS = (N_FFT//2) + 1
N_FFT_FRAMES = ((TOTAL_FRAMES - N_FFT)//(N_FFT // HOP_DIV)) + 1
STFT_NORM = 1 / 1.5 / FREQ_BANDS #Used in noise augmentations. hop_div > 1 raises the total energy from an STFT. Given the params, N_FFT, HOP_DIV and WINDOW_SIZE = N_FFT, this is our energy norm from STFT -> signal

#DATASET GENERATION PARAMS
MAX_FN = 50 #max number of files per class

#NET TRAINING PARAMS

#INPUT_SIZE = (N_FFT_FRAMES, FREQ_BANDS, 1) #If spectrogram is used for training
N_MELS = 128
INPUT_SIZE = (N_FFT_FRAMES, N_MELS, 1) #If mel-spectrogram is used for training

## Bird-to-index encoder/decoder

In [None]:
"""
Create encoder/decoder to convert between birds to indices.
Add a background class represented by 'other', or index 21
"""

SCORED_BIRDS = np.concatenate((SCORED_BIRDS, ['other']))
print(SCORED_BIRDS)

# encode bird id to each scoring birds 
encode = {}
bird_id = 0
for bird in SCORED_BIRDS:
    encode[bird] = bird_id
    bird_id+=1

encode["other"] = 21
decode = {v: k for k, v in encode.items()}

## Metadata for train data

In [None]:
"""
We're going to create a df that expands the loaded bird audio files into their chunks.
We point our tensorflow datagenerator to load specific "chunks" instead of unbatching
This allows us to get the exact filenumber, AND for those that are interested in training on kaggle, allows us to get away with smaller
shuffles on our datagen.
"""
INPUT_DIR = '/kaggle/input/birdclef-2022'

taxonomy_data = pd.read_csv(f'{INPUT_DIR}/eBird_Taxonomy_v2021.csv')
train_data = pd.read_csv(f'{INPUT_DIR}/train_metadata.csv')

#Manipulate dataset with convenient information
train_data['sound_file'] = INPUT_DIR + '/train_audio/' + train_data.filename
train_data.drop(['author','license','url','filename'],axis=1)

## Constructing a Dataframe as an input for a tf.Dataset generator

In [None]:
"""
Construct a dataframe consisting of our birds, filenames and chunks. We shuffle this dataset and pull a smaller buffer during net training
to avoid kaggle from freaking out on us with memory issues

MAX_FN is max number of files per bird to pull. Partially mitigates the impact of imbalanced classes.
Alternatively you can pull all the birds in the desired class and adjust class weights in the loss function.
"""

index_frame = pd.DataFrame(columns = ['birdname', 'secondarybirds', 'filepath', 'duration', 'dur_pos', 'partial'])

def second_check(x):
    """
    For birds without any elements in the 'secondary_birds' column. Pass it a placeholder value
    """
    if len(x) == 0:
        return np.asarray(['placeholder']).astype(np.unicode_)
    return np.asarray(x).astype(np.unicode_)

def encode_second(x):
    """
    Encodes the names of birds into their respective indices.
    If I had more time, I would change the final index from 99 to encode['others'] such that it is captured in my final labels
    """
    return np.asarray([*map(lambda i: encode.get(i, 99), x)])

for bird in SCORED_BIRDS:
    bird_df = train_data[train_data.primary_label == bird]
    n_files = len(bird_df)
    
    if n_files > MAX_FN:
        bird_df = bird_df.sample(MAX_FN, random_state=SEED)
        
    for _ , element in bird_df.iterrows():

        y, _ = librosa.load(element.sound_file, sr=None)

        #Load librosa, we'll chunk the data, have another flag that says "partial" if the remaining duration of a chunk is > MIN_REMAINDER seconds
        chunk, remainder = np.divmod(len(y), TOTAL_FRAMES)
        for i in range(1, chunk+1):
            index_frame = index_frame.append({"birdname":bird, "secondarybirds": element.secondary_labels, "filepath":element.sound_file, "duration": len(y)/SR, "dur_pos":i*5, "partial": False}, ignore_index=True)

        if remainder >= MIN_REMAINDER * SR:
            index_frame = index_frame.append({"birdname":bird, "secondarybirds": element.secondary_labels, "filepath":element.sound_file, "duration": len(y)/SR, "dur_pos":(chunk+1)*5, "partial": True}, ignore_index=True)

#Include the 'other' bird category
other_df = train_data[~train_data['primary_label'].isin(SCORED_BIRDS)].sample(int(MAX_FN), random_state=42)

for _ , element in other_df.iterrows():
    y, _ = librosa.load(element.sound_file, sr=None)

    chunk, remainder = np.divmod(len(y), TOTAL_FRAMES)
    for i in range(1, chunk+1):
        index_frame = index_frame.append({"birdname":"other", "secondarybirds": element.secondary_labels, "filepath":element.sound_file, "duration": len(y)/SR, "dur_pos":i*5, "partial": False}, ignore_index=True)

    if remainder >= MIN_REMAINDER * SR:
        index_frame = index_frame.append({"birdname":"other", "secondarybirds": element.secondary_labels, "filepath":element.sound_file, "duration": len(y)/SR, "dur_pos":(chunk+1)*5, "partial": True}, ignore_index=True)

# Convert all birdname strings into their encoded indices
index_frame['birdname'] = index_frame['birdname'].map(encode.get)        
index_frame['secondarybirds'] = index_frame['secondarybirds'].map(lambda x: encode_second(second_check(ast.literal_eval(x))))

# Pandas does not take ragged arrays well. Convert the secondary_bird array into a standard format of shape (max_secondary, )
max_secondary = index_frame['secondarybirds'].map(len).max()
index_frame['secondarybirds'] = index_frame['secondarybirds'].map(lambda x: np.pad(x, (0, max_secondary - len(x)), constant_values=99))

print(index_frame.head())

index_frame.to_pickle('training_df.pickle')

## Utility functions

In [None]:
"""
Utility functions used in generating dataset and passing it through to the net
"""

def bird_one_hot(bird_elem, birds):
    """
    Birds is an array of named birds that we would like to score.
    To exclude secondarybirds, i.e. converting multi-hot multi-label -> one-hot single-label, you can always manually change SEC_DUR into 0.
    """    
    depth = len(birds)
    
    if tf.math.reduce_any(decode[bird_elem['birdname'].numpy()] == "other"):
        return tf.one_hot(int(depth - 1), depth=depth, dtype=tf.float32)
    
    if bird_elem['duration'] < SEC_DUR:
        concat_label = tf.concat([tf.expand_dims(bird_elem['birdname'], axis=0), tf.cast(bird_elem['secondarybirds'], dtype=tf.int32)], axis=0)
        return tf.reduce_max(tf.one_hot(concat_label, depth=depth, dtype=tf.float32), axis=0)
        
    else:
        return tf.one_hot(bird_elem['birdname'], depth=depth, dtype=tf.float32)
    
@tf.function
def normalize_audio(stft, *label):
    """
    Convert to db or just log it.
    Manually checked that output is more or less between ~[-1,1]
    """
    shift = 4
    scale = 6
    
    if label:
        return ((tf.math.log(stft + 1e-10)/tf.math.log(10.0)) + shift)/scale, label[0]
    
    else:
        return ((tf.math.log(stft + 1e-10)/tf.math.log(10.0)) + shift)/scale

@tf.function
def add_single_channel(stft, *label):
    """
    Utility map to add final channel for net input.
    """
    if label:
        return tf.expand_dims(stft, axis=-1), label[0]
    else:
        return tf.expand_dims(stft, axis=-1)

In [None]:
"""
Augmentation functions.
All of these take in spectrograms, but some can be used with mel-spectrograms.
You will need to calculate STFT_NORM yourself if you mess with N_FFT, HOP_DIV and WINDOW_SIZE.

Calculation is straightforward -> divide total energy from signal(either in time or freq domain)/total energy in STFT.
With this factor you can compute total energy in the STFT and scale back to find total energy of the underlying signal.
"""

@tf.function
def time_roll(inp, *label):
    """
    Roll image in time dimension.
    """
    random_shift = tf.random.uniform(shape=(), minval=0, maxval=N_FFT_FRAMES, dtype=tf.int32)
    roll = tf.roll(inp, shift=random_shift, axis=0)
    
    if label:
        return roll, label[0]
    
    return roll

@tf.function
def white_noise(inp, *label):
    """
    Create a WN signal between 0.01%-1% of total signal energy.
    """
    
    energy_max = 1./100
    
    signal_energy = STFT_NORM * tf.reduce_sum(inp ** 2)
    gain = tf.random.uniform([]) * energy_max
    
    #power of white noise is equal to its variance, remember to divide by N samples to conserve energy after FFT
    std = tf.math.sqrt(signal_energy * gain) / (CHUNK_LENGTH * SR)
    
    noise_signal = tf.random.normal([CHUNK_LENGTH * SR], 0, std)
    noise_stft = tf.abs(tf.signal.stft(noise_signal, frame_length=N_FFT, frame_step=N_FFT//HOP_DIV))
    
    if label:
        return inp + noise_stft, label[0]
    
    return inp + noise_stft

@tf.function
def pink_noise(inp, *label):
    """
    Create a PN signal probably up to 100% due to wanting to get some energy in the middle bands.
    Ideal method is to calc total energy of the pink_noise spectrum, then scale upwards.
    Probably better to implement a user-defined min-freq cutoff. Here the 0Hz bin is replaced with 1./SR
    
    Here the factor of ~2000 comes from SR/2 divided by integral(1/f from ~0 to 16000), i.e. power loss from the 1/f scaling.
    """ 
    signal_energy = STFT_NORM * tf.reduce_sum(inp ** 2)
    gain = tf.random.uniform([])
    
    #power of white noise is equal to its variance
    std = tf.math.sqrt(signal_energy * gain) / (CHUNK_LENGTH * SR) * 2000
    
    inv_f = ((tf.cast(tf.linspace(0, SR//2, FREQ_BANDS), tf.float32))) + tf.concat((tf.constant([SR], dtype=tf.float32), tf.zeros(FREQ_BANDS-1)), axis=0)
    
    noise_signal = tf.random.normal([CHUNK_LENGTH * SR], 0, std)
    noise_stft = tf.abs(tf.signal.stft(noise_signal, frame_length=N_FFT, frame_step=N_FFT//HOP_DIV))/inv_f
    
    if label:
        return inp + noise_stft, label[0]
    
    return inp + noise_stft

@tf.function
def mix_items(inp, label, ds):
    """
    You need to define another dataset. Take care that this dataset contains the same elements as your training dataset to prevent
    data leakage.
    
    Mixing ratio is currently set to 0.4 - 0.7 of our input.
    
    Mix two spectrograms and return a multi-hot label.
    """
    mix_stft, mix_label = next(iter(ds))
    
    alpha = tf.random.uniform([])*0.4 + 0.3
    beta = 1 - alpha
    
    sum_stft = alpha * inp + beta * mix_stft
    sum_label = tf.reduce_max(tf.concat([label[None,:], mix_label[None,:]], axis=0), axis=0)
    
    return sum_stft, sum_label

@tf.function
def random_effect(inp, label, fn, base_prob):
    """
    tf graph mode does not like if-else statements. Wrapper to incorporate random augmentation with a base probability
    """
    tmp_tuple = tf.cond(tf.random.uniform([]) < base_prob, lambda: (fn(inp, label)), lambda: (inp, label))
    
    return tmp_tuple[0], tmp_tuple[1]

@tf.function
def random_mix_wrapper(inp, label, ds, base_prob):
    """
    Much easier to make a separate wrapper for the mixing instead of incorporating it into the random_effect function
    """
    tmp_tuple = tf.cond(tf.random.uniform([]) < base_prob, lambda: (mix_items(inp, label, ds)), lambda: (inp, label))
    
    return tmp_tuple[0], tmp_tuple[1]

def convert_mel(inp, label):
    """
    Take in a spectogram, convert to mel-spectrogram. 
    """
    mel_conv = librosa.feature.melspectrogram(S=tf.transpose(inp)**2, sr=SR, n_mels=N_MELS)

    return mel_conv.T, label
    

## tf.Dataset generator and preprocessing

In [None]:
"""
Bunch of tricks to force tf.Dataset generators to take in a dictionary.
"""
def dict_py_function(func, inp, Tout):
    """
    Trick py_function to take in a dict by passing our input off as an array, then reconstructing it inside a wrapped function
    Completely dumb.
    """
    def wrapped_func(*flattened_inp):
        #To reconstruct, pass tf.nest.pack_sequence_as(dict, flattened_dict_of_values, expand_composites=True)
        reconstructed_inp = tf.nest.pack_sequence_as(inp, flattened_inp, expand_composites=True)
        return func(*reconstructed_inp)
        
    return tf.py_function(func=wrapped_func, inp=tf.nest.flatten(inp, expand_composites=True), Tout=Tout)

def joint_parser(bird_elem, birds):
    """
    Load audio['filepath']
    Construct labels into one/multi-hot format    
    """    
    y, _ = librosa.load(bird_elem['filepath'].numpy(), sr=None)
    
    start_idx = (bird_elem['dur_pos'] - CHUNK_LENGTH) * SR
    end_idx = bird_elem['dur_pos'] * SR
    
    if bird_elem['partial']:
        if bird_elem['duration'] < CHUNK_LENGTH:
            y = tf.concat([y, y[:(end_idx - len(y))]], axis=0)
        else:
            #just overlap up to 2 seconds. easy.
            y = y[-(CHUNK_LENGTH * SR):]
    else:
        y = y[start_idx:end_idx]

    stft_fn = lambda x: tf.abs(tf.signal.stft(x, frame_length=N_FFT, frame_step=N_FFT//HOP_DIV))
    label_arr = bird_one_hot(bird_elem, SCORED_BIRDS)
    
    return stft_fn(y), label_arr

def construct_dataset(bird_set, birds=SCORED_BIRDS, fraction=0.20, seed=SEED, second_shuffle=False):
    """
    ---
    CAUTION: UPDATED ON 24/05. PREVIOUS IMPLEMENTATION INCURRED DATA LEAK.
    CORRECT ORDER IS: SHUFFLE -> SAMPLE -> CHUNK INTO 5S
    
    As our dictionary is already chunked, we're going to shuffle by group (filepath) and pray that birds with small entries do not end up
    in the validation. In future, you can probably code something to ensure that birdsounds < some number is forced in the training set.
    ---
    New function uses slice_from_tensors instead of list_files -> can shuffle the entire list quickly and call smaller buffer for the actual training
    
    WARNING to_dict() changes np,pd objects to python scalars which is subsequently re-interpreted by tensorflow
    You need to EXPLICITLY CAST dtype INSIDE THE MAPPING FUNCTIONS.
    """
    
    #number of unique filepaths, or audio files
    ids = bird_set.filepath.unique()
    rng = np.random.default_rng(seed=SEED)
    rng.shuffle(ids)
    
    split = int(len(ids) * fraction)
    
    tmp_train_birdset = bird_set.set_index("filepath").loc[ids[split:]].reset_index().set_index('birdname').reset_index()
    tmp_val_birdset = bird_set.set_index("filepath").loc[ids[:split]].reset_index().set_index('birdname').reset_index()
    
    train_dataset  = tf.data.Dataset.from_tensor_slices(tmp_train_birdset.to_dict('list'))
    val_dataset  = tf.data.Dataset.from_tensor_slices(tmp_val_birdset.to_dict('list'))
    
    train_len = train_dataset.cardinality().numpy()
    val_len = val_dataset.cardinality().numpy()
    
    train_dataset = train_dataset.shuffle(train_len)
    val_dataset = val_dataset.shuffle(val_len)
    
    if second_shuffle:
        train_dataset = train_dataset.shuffle(train_len)
    
    #need to map a dict deconstructor as py_function does not play well with dicts
    #https://github.com/tensorflow/tensorflow/issues/27679

    train_dataset = train_dataset.map(lambda a: dict_py_function(joint_parser, [a, birds], [tf.float32, tf.float32]), num_parallel_calls=tf.data.experimental.AUTOTUNE)
    val_dataset = val_dataset.map(lambda a: dict_py_function(joint_parser, [a, birds], [tf.float32, tf.float32]), num_parallel_calls=tf.data.experimental.AUTOTUNE)
    
    print(f"-- dataset sizes -- train:{train_len}, val:{val_len} --")

    return train_dataset, val_dataset, train_len, val_len

## ResNet models

Includes Resnet34, Resnet34v2 and Resnet50v2

In [None]:
"""
resnet34 architecture. adapted + thanks to:
    https://www.analyticsvidhya.com/blog/2021/08/how-to-code-your-resnet-from-scratch-in-tensorflow/
    
tested on mnist datafirst. no problem. resnet has a minimum size, so some padding is needed for MNIST data.

you need to pre-define the input size of the spectrogram. can be computed in the globals: (N_FFT_FRAMES, FREQ_BANDS, 1)
"""
def res_block(input_tensor, n_filters, kernel_size, strides=(1,1), activation='relu', padding='same', kernel_initializer="he_normal"):
    x_skip = input_tensor
    x = layers.Conv2D(n_filters, kernel_size, strides=strides, activation=activation, padding=padding, kernel_initializer=kernel_initializer)(input_tensor)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(n_filters, kernel_size, strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    x = layers.BatchNormalization()(x)
    
    x = layers.Add()([x, x_skip])
    x = layers.Activation('relu')(x)

    return x

def conv_res_block(input_tensor, n_filters, kernel_size, strides=(2,2), activation='relu', padding='same', kernel_initializer="he_normal"):
    #essentially downsamples by striding instead of pooling
    x_skip = input_tensor
    x_skip = layers.Conv2D(n_filters, (1,1), strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x_skip)
    
    x = layers.Conv2D(n_filters, kernel_size, strides=strides, activation=activation, padding=padding, kernel_initializer=kernel_initializer)(input_tensor)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(n_filters, kernel_size, strides=(1,1), activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    x = layers.BatchNormalization()(x)
    
    x = layers.Add()([x, x_skip])
    x = layers.Activation('relu')(x)
    
    return x
    
def res_net34(input_size=(28,28,1), n_output=10, n_base=64, final_act='softmax'):
    #mnist image is 28, resnet takes in a minimum of 32,32, so let's pad it by (2,2)
    img_input = layers.Input(input_size)
    #x = layers.ZeroPadding2D((2,2))(img_input)
    
    #from resnet, 34 layers, take a 7x7 conv with n_base, stride 2
    x = layers.Conv2D(n_base, kernel_size=(7,7), strides=(2,2), activation='relu', padding='same')(img_input)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPool2D(pool_size=(3,3), strides=(2,2), padding='same')(x)
    
    block_layers = [3, 4, 6, 3]
    n_filter = n_base
    
    for i in range(4):
        if i == 0:
            for j in range(block_layers[i]):
                x = res_block(x, n_filter, kernel_size=(3,3))
        else:
            n_filter = n_filter * 2
            x = conv_res_block(x, n_filter, kernel_size=(3,3))
            for j in range(block_layers[i] - 1):
                x = res_block(x, n_filter, kernel_size=(3,3))
    
    #finish with an avgpool, FC and softmax
    x = layers.GlobalAveragePooling2D()(x)
    out = layers.Dense(n_output, activation=final_act)(x)
    
    model = tf.keras.Model(inputs=[img_input], outputs=[out])
    return model

In [None]:
"""
resnet34v2 architecture

tested on MNIST. Unused but here to test the v2 implementation of the resblocks. (reordering activation and concatenation operations)

key difference between v1 and v2 is BN - Relu - Conv2d instead of Conv - Relu - BN
"""
def res_blockv2(input_tensor, n_filters, kernel_size, strides=(1,1), activation='relu', padding='same', kernel_initializer="he_normal"):
    x_skip = input_tensor
    
    x = layers.BatchNormalization()(input_tensor)
    
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters, kernel_size, strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters, kernel_size, strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    tf.debugging.check_numerics(x, "x is producing nans!")
    tf.debugging.check_numerics(x_skip, "x_skip is producing nans!")
    
    x = layers.Add()([x, x_skip])

    return x

def conv_res_blockv2(input_tensor, n_filters, kernel_size, strides=(2,2), activation='relu', padding='same', kernel_initializer="he_normal"):
    #essentially downsamples by striding instead of pooling
    x_skip = input_tensor
    x_skip = layers.Conv2D(n_filters, (1,1), strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x_skip)
    
    x = layers.BatchNormalization()(input_tensor)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters, kernel_size, strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    x = layers.BatchNormalization()(x)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters, kernel_size, strides=(1,1), activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    tf.debugging.check_numerics(x, "conv x is producing nans!")
    tf.debugging.check_numerics(x_skip, "conv x_skip is producing nans!")
    
    x = layers.Add()([x, x_skip])
    
    return x
    
def res_net34v2(input_size=(28,28,1), n_output=10, n_base=64):
    #mnist image is 28, resnet takes in a minimum of 32,32, so let's pad it by (2,2)
    img_input = layers.Input(input_size)
    x = layers.ZeroPadding2D((2,2))(img_input)
    
    #from resnet, 34 layers, take a 7x7 conv with n_base, stride 2
    x = layers.Conv2D(n_base, kernel_size=(7,7), strides=(2,2), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPool2D(pool_size=(3,3), strides=(2,2), padding='same')(x)
    
    block_layers = [3, 4, 6, 3]
    n_filter = n_base
    
    for i in range(4):
        if i == 0:
            for j in range(block_layers[i]):
                x = res_blockv2(x, n_filter, kernel_size=(3,3))
        else:
            n_filter = n_filter * 2
            x = conv_res_blockv2(x, n_filter, kernel_size=(3,3))
            for j in range(block_layers[i] - 1):
                x = res_blockv2(x, n_filter, kernel_size=(3,3))
    
    #finish with an avgpool, FC and softmax
    x = layers.GlobalAveragePooling2D()(x)
    tf.debugging.check_numerics(x, "final x is producing nans!")
    out = layers.Dense(n_output, activation='softmax')(x)
    
    model = tf.keras.Model(inputs=[img_input], outputs=[out])
    return model

In [None]:
"""
resnet50v2 architecture.
you need to pre-define the input size of the spectrogram. can be computed in the globals: (N_FFT_FRAMES, FREQ_BANDS, 1)

key difference between v1 and v2 is BN - Relu - Conv2d instead of Conv - Relu - BN

This is our preferred training model. Tested on MNIST data.

Funky interaction with AdamW optimizer (Adam + weight decay).
"""


def res50_block(input_tensor, n_filters, kernel_size, strides=(1,1), activation='relu', padding='same', kernel_initializer="he_normal"):
    """
    slightly different to res32 blocks. (1x1, n_filter) -> (3x3, n_filter) -> (1x1, n_filter * 4)
    """
    x_skip = input_tensor
    x_skip = layers.Conv2D(n_filters * 4, (1,1), strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x_skip)
    
    x = layers.BatchNormalization()(input_tensor)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters, (1,1), strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    x = layers.BatchNormalization()(x)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters, kernel_size, strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    x = layers.BatchNormalization()(x)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters * 4, (1,1), strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    tf.debugging.check_numerics(x, "x is producing nans!")
    tf.debugging.check_numerics(x_skip, "x_skip is producing nans!")
    
    x = layers.Add()([x, x_skip])

    return x

def conv_res50_block(input_tensor, n_filters, kernel_size, strides=(2,2), activation='relu', padding='same', kernel_initializer="he_normal"):
    #essentially downsamples by striding instead of pooling
    x_skip = input_tensor
    x_skip = layers.Conv2D(n_filters * 4, (1,1), strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x_skip)
    
    x = layers.BatchNormalization()(input_tensor)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters, (1,1), strides=(1,1), activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    x = layers.BatchNormalization()(x)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters, kernel_size, strides=strides, activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    x = layers.BatchNormalization()(x)
    x = layers.Activation(activation)(x)
    x = layers.Conv2D(n_filters * 4, (1,1), strides=(1,1), activation=None, padding=padding, kernel_initializer=kernel_initializer)(x)
    
    tf.debugging.check_numerics(x, "conv x is producing nans!")
    tf.debugging.check_numerics(x_skip, "conv x_skip is producing nans!")
    
    x = layers.Add()([x, x_skip])
    
    return x
    
def res_net50(input_size=INPUT_SIZE, n_output=len(SCORED_BIRDS), n_base=64, final_act='softmax'):
    
    img_input = layers.Input(input_size)
    x = layers.ZeroPadding2D((2,2))(img_input)
    
    x = layers.Conv2D(n_base, kernel_size=(7,7), strides=(2,2), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPool2D(pool_size=(3,3), strides=(2,2), padding='same')(x)
    
    block_layers = [3, 4, 6, 3]
    n_filter = n_base
    
    for i in range(4):
        if i == 0:
            for j in range(block_layers[i]):
                x = res50_block(x, n_filter, kernel_size=(3,3))
        else:
            n_filter = n_filter * 2
            x = conv_res50_block(x, n_filter, kernel_size=(3,3))
            for j in range(block_layers[i] - 1):
                x = res50_block(x, n_filter, kernel_size=(3,3))
    
    #finish with an avgpool, FC and softmax
    
    x = layers.Activation('relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.GlobalAveragePooling2D()(x)
    out = layers.Dense(n_output, activation=final_act)(x)
    
    model = tf.keras.Model(inputs=[img_input], outputs=[out])
    return model

## Running and training the model

In [None]:
"""
If you've generated the dataframe earlier, you can load it here.
"""
# index_frame = pd.read_pickle("./training_df.pickle")
# print(index_frame.head())

In [None]:
"""
Generating dataset + mixing dataset. Ensure that the mixing dataset is shuffled a SECOND time via the second_shuffle flag such that
not mixing the same audio file with itself.

Cache (if possible) and repeat.
"""

train_ds, val_ds, train_len, val_len = construct_dataset(index_frame, birds=SCORED_BIRDS, fraction=0.20, seed=SEED)

mix_ds, _, _ , _ = construct_dataset(index_frame, birds=SCORED_BIRDS, fraction=0.20, seed=SEED, second_shuffle=True)
mix_ds = mix_ds.cache().repeat()


In [None]:
"""
Net parameters

Included is the compile for BCE + Sigmoid if you wish to use that instead.
"""
def weighted_cce(y_true, y_pred):
    """
    loss function. categorical crossentropy, but weighted by 1/K.
    Multi-label training with softmax + CCE. tests by facebook (see. https://arxiv.org/pdf/1805.00932.pdf) show that this offers a
    moderate improvement to training, at the expense of inference ability. (you don't know number of labels when inferring,
    so you need to guess or have another model)
    """
    y_scale = (1/tf.math.reduce_sum(y_true, axis=-1))[:,tf.newaxis] #shape (batch, 1)
    y_new = tf.math.multiply(y_true, y_scale)
    loss = tf.keras.losses.categorical_crossentropy(y_new, y_pred)
    
    return loss

EPOCHS = 100
BATCH_SIZE = 32 #16
STEPS_PER_EPOCH = train_len // BATCH_SIZE

model = res_net50(input_size=INPUT_SIZE, n_output=len(SCORED_BIRDS), final_act='softmax')

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4, epsilon=1e-1)

#model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['binary_accuracy', tfa.metrics.FBetaScore(num_classes=len(SCORED_BIRDS), average='macro')])
model.compile(loss=weighted_cce, optimizer=optimizer, metrics=[tfa.metrics.FBetaScore(num_classes=len(SCORED_BIRDS), average='macro')])
model.summary()

In [None]:
"""
Callbacks for training
"""
earlystopping = callbacks.EarlyStopping(monitor='val_loss', verbose=1, min_delta=1e-4, patience=5)

plateau = callbacks.ReduceLROnPlateau(factor=0.5, patience=3, min_lr=1e-6, verbose=1)

checkpoint = callbacks.ModelCheckpoint(f'./checkpoint.hdf5', monitor='val_loss', verbose=1, save_best_only=True)

csv_logger = callbacks.CSVLogger(f'./log.out', separator=',')

callbacks = [earlystopping, plateau, checkpoint, csv_logger]


In [None]:
"""
Augmenting dataset
Remember to normalize + add single channel at the end.
By experience in Kaggle, TRAIN_SHUFFLE, VAL_SHUFFLE must be low, otherwise set it to train_len and val_len respectively

ONLY CACHE if you are able to fill your shuffle buffer with all the elements. otherwise remove it or live with the warnings
"""
TRAIN_SHUFFLE = 1 #train_len
VAL_SHUFFLE = 1 #val_len

train_dataset = train_ds.cache().shuffle(TRAIN_SHUFFLE) \
                                .map(time_roll) \
                                .map(lambda x1, y1: random_effect(x1,y1,white_noise, 0.1)) \
                                .map(lambda x2, y2: random_effect(x2, y2, pink_noise, 0.3)) \
                                .map(lambda x3, y3: random_mix_wrapper(x3, y3, mix_ds, 0.4)) \
                                .map(lambda x4, y4: tf.py_function(convert_mel, [x4, y4], [tf.float32, tf.float32])) \
                                .map(normalize_audio).map(add_single_channel).batch(BATCH_SIZE).repeat()

val_dataset = val_ds.map(lambda x4, y4: tf.py_function(convert_mel, [x4, y4], [tf.float32, tf.float32])) \
                    .map(normalize_audio) \
                    .map(add_single_channel) \
                    .cache().shuffle(VAL_SHUFFLE).batch(BATCH_SIZE)

In [None]:
"""
Training the model. Tested and works, but as stated, you'll want to do this on an external computer or live with the thousand warnings.
"""
#os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' #Suppress warnings if you want

model_history = model.fit(train_dataset,
                          epochs=EPOCHS,
                          steps_per_epoch=STEPS_PER_EPOCH,
                          validation_data=val_dataset,
                          callbacks=callbacks)

In [None]:
"""
Save model as hdf5
"""
model.save(f'model.hdf5')

## Plot model training history

In [None]:
"""
Simple plots. You can query more via model_history.history['(metric)']
"""

loss = model_history.history['loss']
val_loss = model_history.history['val_loss']

fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot(epochs, loss, 'r', label='Training loss')
ax.plot(epochs, val_loss, 'bo', label='Validation loss')

ax.set_xlabel('Epoch')
ax.set_ylabel('Loss Value')

ax.set_ylim([0,np.max(loss[0],val_loss[0])])
ax.legend()

In [None]:
"""
Load model if you have.

Ensure you have compile=False unless you have the custom loss function defined elsewhere.
"""
#model = tf.keras.models.load_model('./model.hdf5', compile=False)

# Testing pipeline + Examples

Some examples to visualise outputs from the Dataset generator

In [None]:
"""
Take one element from val_ds (NOT val_dataset, which is the batched transformed images used in validation), apply transformations and
plot the result. Make sure you run the appropriate section/s that defines the augmentation and the dataset.
"""
for e in val_ds.take(1).map(lambda x1, y1: random_effect(x1, y1, white_noise, 0.5)) \
                       .map(lambda x2,y2: random_effect(x2, y2, pink_noise, 0.5)) \
                       .map(lambda x3, y3: random_mix_wrapper(x3, y3, mix_ds, 0.5)) \
                       .map(lambda x4, y4: tf.py_function(convert_mel, [x4, y4], [tf.float32,tf.float32])) \
                       .map(normalize_audio):
    print(f'Label vector: {e[1]}')
    
    fig, ax = plt.subplots()
    img = librosa.display.specshow(e[0].numpy().T, sr=SR, y_axis='mel', x_axis='time', ax=ax, vmin=-1, vmax=1, cmap='coolwarm')
    ax.set_title('Mel spectrogram')
    fig.colorbar(img, ax=ax)
    
for e in val_ds.take(1).map(lambda x4, y4: tf.py_function(convert_mel, [x4, y4], [tf.float32,tf.float32])) \
                       .map(normalize_audio):
    print(f'Label vector: {e[1]}')
    
    fig, ax = plt.subplots()
    img = librosa.display.specshow(e[0].numpy().T, sr=SR, y_axis='mel', x_axis='time', ax=ax, vmin=-1, vmax=1, cmap='coolwarm')
    ax.set_title('Mel spectrogram')
    fig.colorbar(img, ax=ax)