**(CD Methods for FSD50K dataset)**
# Audio Classification with Sound Descriptions

This notebook reproduces results for the ESC-50 and FSD50K datasets shown in the paper **"A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION"** by Michel Olvera, Paraskevas Stamatiadis and Slim Essid.

## Overview

The experiment evaluates different class description methods for zero-shot audio classification using CLAP (Contrastive Language-Audio Pre-training) models. The notebook compares:

- **CLS**: Standard class names
- **Context**: Contextual descriptions
- **Ontology**: Ontology-based descriptions  
- **Base**: Basic descriptions
- **Dictionary**: Dictionary-style descriptions

## Methodology

The evaluation uses a cross-validation approach where for each class, the system selects the best-performing description type (CLS vs. definition-based) during training, then applies this mapping to test data. Performance is measured using class-wise accuracy for single-label datasets or mean Average Precision (mAP) for multi-label datasets.


# Import libraries

In [71]:
import os
import argparse
import pandas as pd
from msclap import CLAP
import torch.nn.functional as F
import numpy as np
import pickle
import torch
import os
import argparse
import json
from torch.utils.data import DataLoader
from config import conf, common_parameters
from pprint import pprint
from utilities import merge_dicts
from metrics_helper import compute_metrics, compute_class_wise_accuracy
from sklearn.model_selection import KFold, StratifiedKFold

# Set experiment parameters

In [65]:
# conf_id = "001" # CD Methods for ESC50 Dataset
conf_id = "002" # CD Methods for FSD50K Dataset

conf = merge_dicts(common_parameters, conf[conf_id])
conf


{'job_id': '698390',
 'output_folder': '/tsi/audiosig/audible/dcase/studies/016_CLAP_prompting_with_descriptors/003_evaluate_prompts/results',
 'similarities_folder': '/tsi/audiosig/audible/dcase/studies/016_CLAP_prompting_with_descriptors/003_evaluate_prompts/similarities',
 'audio_embeddings_folder': '/tsi/audiosig/audible/dcase/studies/016_CLAP_prompting_with_descriptors/001_extract_audio_embeddings/embeddings',
 'text_embeddings_folder': '/tsi/audiosig/audible/dcase/studies/016_CLAP_prompting_with_descriptors/002_extract_text_embeddings/embeddings',
 'model_name': 'CLAP-MS-23',
 'definition_type': 'CLS',
 'test_dataset': 'FSD50K',
 'evaluation_mode': 'CLS'}

In [None]:
# Class-wise model selection through cross-validation

model_name = conf['model_name']
test_dataset = conf['test_dataset']
definition_type = conf['definition_type']
model_name = conf['model_name']
evaluation_mode = conf['evaluation_mode']

# Load dataset
audio_embeddings_path = os.path.join(conf['audio_embeddings_folder'],
                                        model_name,
                                        test_dataset + '.pt')

if test_dataset != 'TUT2017':
    definition_types = ['CLS', 'context', 'ontology', 'base', 'dictionary']
else:
    definition_types = ['CLS', 'ontology', 'base', 'dictionary']


# Load text embeddings for each definition type
text_embeddings_paths = []
for definition_type in definition_types:
    text_embeddings_paths.append(os.path.join(conf['text_embeddings_folder'],
                                        'CLAP-MS-23',
                                        test_dataset + '_' + definition_type + '.pkl'))
    
# Load CLAP model
if model_name == 'CLAP-MS-23':
    clap_model = CLAP(version = '2023', use_cuda=True)
else:
    raise ValueError('Please specify a valid model')

In [67]:
# Read embeddings
audio_embeddings = torch.load(audio_embeddings_path)
print("Audio embeddings shape: ", audio_embeddings.shape)

# Read ground-truth labels
labels = torch.load(audio_embeddings_path.replace('.pt', '_labels.pt'))
print("Labels shape: ", labels.shape)

# Labels are one-hot encoded. Convert them to integers
if test_dataset == 'FSD50K' or test_dataset == 'AudioSet' or test_dataset == 'DCASE2017':
    labels_1D = labels.detach().cpu().numpy()
else:
    labels_1D = torch.argmax(labels, dim=1) 

# Read text embeddings dictionaries
prompts_dictionary_list = []
for text_embeddings_path in text_embeddings_paths:
    with open(text_embeddings_path, 'rb') as f:
        prompts_dictionary_list.append(pickle.load(f))

# Select the text embeddings from the key '' in the dictionaries
text_embeddings_list = []
for prompts_dictionary in prompts_dictionary_list:
    text_embeddings_list.append(prompts_dictionary['']['embeddings'])



Audio embeddings shape:  torch.Size([51197, 1024])
Labels shape:  torch.Size([51197, 200])


## Cross-validation Setup

In [68]:
# Cross-validation. Generate deterministic folds of audio embeddings, text embeddings and labels
# using scikit-learn's StratifiedKFold

# Print shape of data
print("Audio embeddings shape: ", audio_embeddings.shape)
print("Labels shape: ", labels.shape)
print("Text embeddings", text_embeddings_list[0].shape)

audio_embeddings_train_folds = []
labels_train_folds = []

audio_embeddings_test_folds = []
labels_test_folds = []

# Define the number of folds
n_folds = 5

# Define the seed for reproducibility
seed = 42

# Define the stratified k-fold object
if test_dataset == 'FSD50K' or test_dataset == 'AudioSet' or test_dataset == 'DCASE2017':
    # Cross validation suitable for multi-label classification
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=seed)

else:
    # Cross validation suitable for single-label classification
    # StratifiedKFold is used to ensure that the proportion of classes is the same in each fold
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)


# Generate the folds
for train_index, test_index in kf.split(audio_embeddings, labels_1D):

    # print("TRAIN:", train_index, "TEST:", test_index)

    audio_embeddings_train_folds.append(audio_embeddings[train_index])
    labels_train_folds.append(labels[train_index])

    audio_embeddings_test_folds.append(audio_embeddings[test_index])
    labels_test_folds.append(labels[test_index])
    

Audio embeddings shape:  torch.Size([51197, 1024])
Labels shape:  torch.Size([51197, 200])
Text embeddings torch.Size([200, 1024])


# Evaluation

In [69]:
results_df = pd.DataFrame()

if evaluation_mode == 'CLS':

    # We'll take the max accuracy/mAP between CLS and the other definition types

    for fold in range(n_folds):
        print("\nFold: ", fold)

        metric_results = [] # List to store the dictionary of results of each definition type

        for text_embeddings in text_embeddings_list:
            # print("Text embeddings shape: ", text_embeddings.shape)
            # Compute similarity
            y_labels = labels_train_folds[fold].detach().cpu().numpy()
            similarity = clap_model.compute_similarity(audio_embeddings_train_folds[fold], text_embeddings)

            if test_dataset == 'FSD50K' or test_dataset == 'AudioSet' or test_dataset == 'DCASE2017':
                # Process similarities and compute mAP
                y_pred = similarity.detach().cpu().numpy()

                _, _, class_wise_metrics_dict = compute_metrics(y_labels, y_pred, normalize_scores=False)
                # metric_result = sum(ap.values())/len(ap.values())
                # print('mAP: {}'.format(metric_result))
            
            else:

                # Process similarities and compute accuracy
                y_pred = F.softmax(similarity, dim=1).detach().cpu().numpy()
                
                class_wise_metrics_dict = compute_class_wise_accuracy(
                    np.argmax(y_labels, axis=1),
                    np.argmax(y_pred, axis=1))
                
                # print('Accuracy: {}'.format(metric_result))
            
            metric_results.append(class_wise_metrics_dict)


        # # Take the max accuracy/mAP between CLS and the other definition types
        CLS_metrics_dict = metric_results[0]
        definitions_metrics_dicts = metric_results[1:]

        # Model-class mapping
        mapping = dict() # Will store which predictor performs best for each class

        for i, definition_dict in enumerate(definitions_metrics_dicts):
            # Definition type index
            print("\nDefinition type: ", definition_types[i+1])
            metrics_dict = CLS_metrics_dict.copy()
            for class_index in definition_dict.keys():
                initial_accuracy = CLS_metrics_dict[class_index]
                new_accuracy = 0
                if definition_dict[class_index] > CLS_metrics_dict[class_index]:
                    metrics_dict[class_index] = definition_dict[class_index]
                    new_accuracy = definition_dict[class_index]
                # Construct mapping
                if initial_accuracy < new_accuracy:
                    mapping[class_index] = 'DEF' # If definition performs better
                else:
                    mapping[class_index] = 'CLS' # If CLS performs better

            # print("Mapping: ", mapping)

                    
            metric_result_training = sum(metrics_dict.values()) / len(metrics_dict.values())
            print('Training accuracy: {}'.format(metric_result_training))


            test_cval = True

            if test_cval:
                # Test the mapping in the test set of the fold
                metric_results = [] # List to store the dictionary of results of each definition type

                text_embeddings_test_list = [text_embeddings_list[0], text_embeddings_list[i+1]]

                for text_embeddings in text_embeddings_test_list:
                    # print("Text embeddings shape: ", text_embeddings.shape)
                    # Compute similarity
                    y_labels = labels_test_folds[fold].detach().cpu().numpy()

                    similarity = clap_model.compute_similarity(audio_embeddings_test_folds[fold], text_embeddings)

                    if test_dataset == 'FSD50K' or test_dataset == 'AudioSet' or test_dataset == 'DCASE2017':
                        # Process similarities and compute mAP
                        y_pred = similarity.detach().cpu().numpy()

                        _, _, class_wise_metrics_dict = compute_metrics(y_labels, y_pred, normalize_scores=False)
                    
                    else:

                        # Process similarities and compute accuracy
                        y_pred = F.softmax(similarity, dim=1).detach().cpu().numpy()
                        
                        class_wise_metrics_dict = compute_class_wise_accuracy(
                            np.argmax(y_labels, axis=1),
                            np.argmax(y_pred, axis=1))
                        
                        # print('Accuracy: {}'.format(metric_result))
                    
                    metric_results.append(class_wise_metrics_dict)


                # # Take the max accuracy/mAP between CLS and the other definition types
                CLS_metrics_dict = metric_results[0]
                definitions_metrics_dicts = metric_results[1:]


                for definition_dict in definitions_metrics_dicts:
                    # Definition type index
                    metrics_dict = CLS_metrics_dict.copy()
                    for class_index in definition_dict.keys():
                        # Use the mapping to select the best predictor for each class
                        if mapping[class_index] == 'DEF':
                            metrics_dict[class_index] = definition_dict[class_index]

                    metric_result_test = sum(metrics_dict.values()) / len(metrics_dict.values())
                    print('Test Accuracy: {}'.format(metric_result_test))
                    print('---'*10)

                    # Store the results in a dictionary
                    results = pd.DataFrame({
                        'model_name': [model_name],
                        'test_dataset': [test_dataset],
                        'fold': [fold],
                        'definition_type': [definition_types[i+1]],
                        'training_result': [metric_result_training],
                        'test_result': [metric_result_test]
                    })
                    results_df = pd.concat([results_df, results], ignore_index=True)
                    


Fold:  0

Definition type:  context
Training accuracy: 0.5084328638249755
Test Accuracy: 0.5093378990053359
------------------------------

Definition type:  ontology
Training accuracy: 0.509634043440286
Test Accuracy: 0.5010220721930971
------------------------------

Definition type:  base
Training accuracy: 0.5096588704333738
Test Accuracy: 0.4974945917069452
------------------------------

Definition type:  dictionary
Training accuracy: 0.4994131406595965
Test Accuracy: 0.4919644601663952
------------------------------

Fold:  1

Definition type:  context
Training accuracy: 0.5061936609866228
Test Accuracy: 0.5181978362066106
------------------------------

Definition type:  ontology
Training accuracy: 0.5138439861445
Test Accuracy: 0.5122187561968957
------------------------------

Definition type:  base
Training accuracy: 0.5131422476210318
Test Accuracy: 0.5065613403609018
------------------------------

Definition type:  dictionary
Training accuracy: 0.5044025527028975
Test Ac

# Compute final results

In [70]:
# Compute final results across folds from results dataframe
definition_types_as_in_paper = ['CLS', 'dictionary', 'base', 'context', 'ontology']

for definition_type in definition_types_as_in_paper[1:]:

    # Filter results for the current definition type
    results_df_def = results_df[results_df['definition_type'] == definition_type]

    # Compute mean and std of training and test results across folds
    training_mean = results_df_def['training_result'].mean()
    training_std = results_df_def['training_result'].std()

    test_mean = results_df_def['test_result'].mean()
    test_std = results_df_def['test_result'].std()

    print("\nDefinition type: ", definition_type)
    print("Training result: {:.4f} +/- {:.4f}".format(training_mean, training_std))
    print("Test result: {:.4f} +/- {:.4f}".format(test_mean, test_std))


Definition type:  dictionary
Training result: 0.5018 +/- 0.0020
Test result: 0.4972 +/- 0.0034

Definition type:  base
Training result: 0.5102 +/- 0.0018
Test result: 0.5039 +/- 0.0038

Definition type:  context
Training result: 0.5074 +/- 0.0009
Test result: 0.5128 +/- 0.0033

Definition type:  ontology
Training result: 0.5128 +/- 0.0019
Test result: 0.5074 +/- 0.0042
