<a href="https://colab.research.google.com/github/suhaaskarthik/birdsong-classification/blob/main/audio_classification_tl-melspec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

birdclef_2024_path = kagglehub.competition_download('birdclef-2024')
suhaaskarthikeyan_audio_files_path = kagglehub.dataset_download('suhaaskarthikeyan/audio-files')
suhaaskarthikeyan_best_model_path = kagglehub.dataset_download('suhaaskarthikeyan/best-model')
suhaaskarthikeyan_best_weight_36_path = kagglehub.dataset_download('suhaaskarthikeyan/best-weight-36')

print('Data source import complete.')


# Audio classification
This notebook covers audio classification through melspectograms, image classification techniques through transfer learning, and finetuning. As well as learning rate schedulers, dataset preparation, preprocessing, batching and shuffling


First we load the audio, with the default sample rate, and convert it to melspectograms (from amplitude-time) to (frequency-time) since frequency based numerics give us better info about the audio. The parameters used for the conversion to melsepctograms are famously used for this dataset, yielding better results.Then we convert it to logarithmic scale, (loudness is a logarithmic parameter).

We convert it to an RGB like 3-channel array, so that it can be compatible with the model. We resize to ensure uniformity. Preprocess function converts tensorflow based variables (given as input for training, under graph execution) into numpy accessible quentities, by using py_function wrapper

In [None]:
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import cv2

def audio_to_melspectrogram_image(audio_path, sr=22050, n_fft=2048, hop_length=512, n_mels=128, f_min=20, f_max=16000, duration=5, img_size=256):
    audio_path = audio_path.numpy().decode("utf-8")
    # Load the first 'duration' seconds of audio
    y, sr = librosa.load(audio_path, sr = sr)

    # Compute mel spectrogram
    mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=512,
    n_mels=128, fmax=sr // 2)

    # Convert to log scale (dB)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    mel_spec_norm = 255 * (mel_spec_db - mel_spec_db.min()) / (mel_spec_db.max() - mel_spec_db.min())
    mel_spec_norm = mel_spec_norm.astype(np.float32)
    mel_image = Image.fromarray(mel_spec_norm)
    mel_image = mel_image.resize((img_size, img_size), Image.LANCZOS)

    # Convert to 3-channel image
    mel_image = np.stack([mel_image] * 3, axis=-1)
    return mel_image

def preprocess(file_path):
    features = tf.py_function(
            func=audio_to_melspectrogram_image,
            inp=[file_path],
            Tout=tf.float32
        )

    return features


audio.csv consists of the audios that last more than 5 seconds, audios that give us a better quality. According to the various discussions forums, it was better to clip to the audios by a certain time, hence assuring uniformity in the duration of the audio, and also capturing most important aspect of the data.

In [None]:
import os
import pandas as pd
df =pd.read_csv('/kaggle/input/audio-files/audio.csv')
bird_classes = os.listdir('/kaggle/input/birdclef-2024/train_audio')
labels = []
files = []
for i in df['audio']:
    files.append(i)
    bc = i.split('/')[-2]
    labels.append(bird_classes.index(bc))

Getting training and testing data, shuffling labels, features simultaneausly by providing a seed

In [None]:
import random
random.seed(123)
random.shuffle(files)
random.seed(123)
random.shuffle(labels)
train_sample = int(len(files)*0.9)
training_files = files[:train_sample]
training_labels = labels[:train_sample]

testing_files = files[train_sample:]
testing_labels = labels[train_sample:]

We use tensorflow's graph based execution, to map the training data to its preprocessing function, that has been wrapped around the py_function to allow compatibility with numpy. We create for labels and audio datasets and zip them together, then batch them up. Same being done with test data

In [None]:
import tensorflow as tf
training_dataset_files = tf.data.Dataset.from_tensor_slices(training_files)
training_dataset_files = training_dataset_files.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
training_dataset_labels = tf.data.Dataset.from_tensor_slices(training_labels)
training_data = tf.data.Dataset.zip((training_dataset_files, training_dataset_labels))
training_data = training_data.map(lambda audio, label: (tf.ensure_shape(audio, (256,256,3)),
                                          tf.ensure_shape(label, ())))

training_data = training_data.batch(64)

testing_dataset_files = tf.data.Dataset.from_tensor_slices(testing_files)
testing_dataset_files = testing_dataset_files.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
testing_dataset_labels = tf.data.Dataset.from_tensor_slices(testing_labels)
testing_data = tf.data.Dataset.zip((testing_dataset_files, testing_dataset_labels))
testing_data = testing_data.map(lambda audio, label: (tf.ensure_shape(audio, (256,256,3)),
                                          tf.ensure_shape(label, ())))

testing_data = testing_data.batch(64)




This class applies cosine annealing learning rate scheduler that ensures linear increase (for a certain set of warmup epochs) in learning rate followed by a cosine decay

In [None]:
import tensorflow as tf
import numpy as np

class CosineAnnealingWithWarmup(tf.keras.callbacks.Callback):
    def __init__(self, total_epochs, warmup_epochs=5, peak_lr=1e-4):
        super().__init__()
        self.total_epochs = total_epochs
        self.warmup_epochs = warmup_epochs
        self.peak_lr = peak_lr
        self.current_epoch = 0

    def on_epoch_begin(self, epoch, logs=None):
        self.current_epoch = epoch + 1  # Keras epoch starts from 0
        new_lr = self.compute_lr()
        self.model.optimizer.learning_rate.assign(new_lr)
        print(f"Epoch {self.current_epoch}: Learning Rate = {new_lr:.6f}")

    def compute_lr(self):
        if self.current_epoch <= self.warmup_epochs:
            # Linear Warmup: Increase LR linearly to peak_lr
            return (self.peak_lr / self.warmup_epochs) * self.current_epoch
        else:
            # Cosine Annealing
            progress = (self.current_epoch - self.warmup_epochs) / (self.total_epochs - self.warmup_epochs)
            return 0.5 * self.peak_lr * (1 + np.cos(np.pi * progress))

Model checkpointing for each epoch

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint

# Define the checkpoint callback
checkpoint_callback = ModelCheckpoint(
    filepath="model_checkpoint_epoch_{epoch:02d}.keras",  # Save model after each epoch
    save_weights_only=False,  # Set to True if you only want to save weights
    save_best_only=False,  # Set to True to save only the best model based on validation loss
    verbose=1
)

Building the model through trasfer learning with efficientnetB0, applying flip and cutout augmentations followed by our very own output layer. They are all strung up through the functional API keras. We use early stopping callback asw. Here only 16 epochs was set, but it is better if 30-40 epochs are used for this model. With 16 epochs it reached an accuracy of around 18%.

This pretty good considering the fact that there is around 184 classes to choose from. So these predictions are no way near random.

In [None]:
import tensorflow as tf
import keras_cv
from tensorflow.keras import layers
from tensorflow.keras.applications.efficientnet_v2 import EfficientNetV2B0, preprocess_input


# Ensure bird_classes is defined
num_classes = len(bird_classes)

# Load base model
base_model = EfficientNetV2B0(include_top=False)
base_model.trainable = False  # Freeze for feature extraction

# Define Model
input_shape = (256, 256, 3)
inputs = layers.Input(shape=input_shape, name="input_layer")
x = tf.keras.layers.RandomFlip(mode="horizontal")(inputs)  # Horizontal Flip
x = keras_cv.layers.RandomCutout(height_factor=0.2, width_factor=0.2)(x)
x = base_model(x, training=False)  # Keep batchnorm frozen
x = layers.GlobalAveragePooling2D(name="global_average_pooling_layer")(x)
outputs = layers.Dense(num_classes, activation="softmax", name="output_layer")(x)
model = tf.keras.Model(inputs, outputs)

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

total_epochs = 16
warmup_epochs = 5
peak_lr = 1e-4
cosine_warmup_callback = CosineAnnealingWithWarmup(total_epochs, warmup_epochs, peak_lr)

# Early stopping callback (Accuracy monitoring)
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor="val_accuracy", patience=7, mode="max", restore_best_weights=True
)

'''
model.fit(
    training_data,
    validation_data=testing_data,
    epochs=total_epochs,
    batch_size=64,
    callbacks=[cosine_warmup_callback, early_stopping,checkpoint_callback]
)
'''

Here we apply finetuning tehcniques to unfreeze the top 10 layers of the model by changing their trainable property to True. This means more trainable weights, hence higher efficiency which obviously seems to be the case, as within 5 epochs the val_accuracy went up to 36%

This is the code to make the last 10 layers of the base efficientnet trainable:

In [None]:
'''
base_model.trainable = True
for layer in model_2_base_model.layers[:-10]:
  layer.trainable = False
'''

Here i just loaded one of my already pretrained (fientuned model) for further training. IF u want to unfreeze the layers the execute the code above, then ignore the one below

In [None]:
import tensorflow as tf
model = tf.keras.models.load_model(
    '/kaggle/input/best-model/model_checkpoint_epochft_02.keras'
)

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
checkpoint_callback = ModelCheckpoint(
    filepath="model_checkpoint_epochft_{epoch:02d}.keras",  # Save model after each epoch
    save_weights_only=False,  # Set to True if you only want to save weights
    save_best_only=False,  # Set to True to save only the best model based on validation loss
    verbose=1
)
model.compile(loss="sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), # lr is 10x lower than before for fine-tuning
                metrics=["accuracy"])


'''
model.fit(
    training_data,
    validation_data=testing_data,
    epochs=5,
    batch_size=64,
    callbacks=[checkpoint_callback]
)
'''

In [None]:
fin_model = tf.keras.models.load_model(
    '/kaggle/input/best-weight-36/model_checkpoint_epochft_05.keras'
)
for i,o in testing_data.take(1):
    for audio,label in zip(i,o):
        res = model.predict(np.expand_dims(audio, axis=0))
        p_index = np.argmax(res)
        print('predicted result: ', bird_classes[p_index])
        print('actual result: ', bird_classes[label])