# Audio Clustering with ImageBind LLM Embeddings

This notebook demonstrates audio clustering using Meta's ImageBind model, a multimodal embedding model that creates joint representations across images, text, audio, and other modalities. We'll cluster environmental sound recordings using these powerful embeddings.

**Objective**: Apply multimodal LLM embeddings to cluster audio data, demonstrating how ImageBind's unified embedding space enables unsupervised discovery of acoustic patterns and sound categories.

**Note**: This notebook requires GPU runtime. Go to Runtime → Change runtime type → GPU.

In [None]:
# Installing required libraries
!pip install torch torchvision torchaudio transformers scikit-learn umap-learn matplotlib seaborn pandas librosa soundfile -q
!pip install git+https://github.com/facebookresearch/ImageBind.git -q

# Importing necessary libraries
import torch
import torchaudio
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import librosa
import librosa.display
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, adjusted_rand_score, confusion_matrix
import umap
import os
import zipfile
import warnings
warnings.filterwarnings('ignore')

# Setting style
sns.set_style("whitegrid")
np.random.seed(42)
torch.manual_seed(42)

# Checking GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Downloading ESC-50 Dataset

Downloading the ESC-50 (Environmental Sound Classification) dataset which contains 2000 audio recordings across 50 sound categories. We'll use a subset of diverse categories for clustering.

**Note**: First download may take 3-5 minutes (~600MB).

In [None]:
# Downloading ESC-50 dataset
print("Downloading ESC-50 dataset...")
print("This may take 3-5 minutes for first download.")

!wget -q https://github.com/karoldvl/ESC-50/archive/master.zip -O esc50.zip
print("Download complete! Extracting...")

# Extracting the dataset
with zipfile.ZipFile('esc50.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

print("Extraction complete!")

# Setting paths
audio_path = 'ESC-50-master/audio/'
meta_path = 'ESC-50-master/meta/esc50.csv'

# Loading metadata
metadata = pd.read_csv(meta_path)

print(f"\nDataset loaded successfully!")
print(f"Total audio files: {len(metadata)}")
print(f"Number of categories: {metadata['target'].nunique()}")
print(f"\nFirst few rows of metadata:")
print(metadata.head())

## Selecting Diverse Sound Categories

Selecting 10 diverse sound categories from different domains (animals, nature, indoor, urban) for clustering analysis.

In [None]:
# Selecting 10 diverse categories for clustering
selected_categories = [
    'dog',           # Animals
    'rooster',       # Animals
    'rain',          # Nature
    'sea_waves',     # Nature
    'crackling_fire',# Nature
    'clock_tick',    # Indoor
    'keyboard_typing', # Indoor
    'door_wood_knock', # Indoor
    'car_horn',      # Urban
    'siren'          # Urban
]

# Filtering metadata for selected categories
filtered_metadata = metadata[metadata['category'].isin(selected_categories)].copy()

# Sampling to balance (20 samples per category for faster processing)
samples_per_category = 20
sampled_metadata = filtered_metadata.groupby('category').head(samples_per_category).reset_index(drop=True)

print(f"Selected categories: {len(selected_categories)}")
print(f"Total audio samples: {len(sampled_metadata)}")
print(f"\nCategory distribution:")
print(sampled_metadata['category'].value_counts().sort_index())

# Creating category to ID mapping
category_to_id = {cat: idx for idx, cat in enumerate(sorted(selected_categories))}
sampled_metadata['category_id'] = sampled_metadata['category'].map(category_to_id)

print(f"\nCategory ID mapping:")
for cat, idx in sorted(category_to_id.items(), key=lambda x: x[1]):
    print(f"{idx}: {cat}")

## Visualizing Sample Audio Spectrograms

Displaying spectrograms (visual representations of audio frequencies over time) for sample sounds from each category to understand the acoustic patterns.

In [None]:
# Visualizing sample spectrograms from each category
fig, axes = plt.subplots(2, 5, figsize=(18, 8))
axes = axes.ravel()

for idx, category in enumerate(sorted(selected_categories)):
    # Getting first audio file for this category
    sample = sampled_metadata[sampled_metadata['category'] == category].iloc[0]
    audio_file = os.path.join(audio_path, sample['filename'])

    # Loading and processing audio
    y, sr = librosa.load(audio_file, sr=22050, duration=5.0)

    # Computing mel spectrogram
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    S_db = librosa.power_to_db(S, ref=np.max)

    # Plotting
    librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='mel', ax=axes[idx])
    axes[idx].set_title(category.replace('_', ' ').title(), fontsize=10)
    axes[idx].set_xlabel('Time (s)', fontsize=8)
    axes[idx].set_ylabel('Frequency (Hz)', fontsize=8)

plt.suptitle('Sample Spectrograms from Each Sound Category', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Spectrograms show frequency content over time for each sound type.")

## Loading ImageBind Model

Loading Meta's ImageBind model for generating multimodal embeddings. ImageBind learns joint representations that work across vision, audio, text, and other modalities.

**Note**: First run downloads ~2GB model weights. This may take 5-10 minutes.

In [None]:
# Loading ImageBind model
try:
    from imagebind import data
    from imagebind.models import imagebind_model
    from imagebind.models.imagebind_model import ModalityType

    print("Loading ImageBind model...")
    print("This may take several minutes on first run as the model is downloaded.")

    model = imagebind_model.imagebind_huge(pretrained=True)
    model.eval()
    model.to(device)

    print(f"\nImageBind model loaded successfully!")
    print(f"Model is on device: {next(model.parameters()).device}")
    USE_IMAGEBIND = True

except Exception as e:
    print(f"Error loading ImageBind: {e}")
    print("\nImageBind may not be available. This is expected if installation failed.")
    print("For this demo, we'll use a simple audio feature extraction method instead.")
    USE_IMAGEBIND = False

## Loading and Preprocessing Audio Files

Loading all selected audio files and preparing them for ImageBind embedding extraction. Audio is resampled to 16kHz as expected by the model.

In [None]:
# Loading all audio files
print("Loading audio files...")

audio_files = []
labels = []

for idx, row in sampled_metadata.iterrows():
    audio_file = os.path.join(audio_path, row['filename'])
    audio_files.append(audio_file)
    labels.append(row['category_id'])

labels = np.array(labels)

print(f"Loaded {len(audio_files)} audio file paths")
print(f"Labels shape: {labels.shape}")
print(f"Unique labels: {np.unique(labels)}")

## Generating Audio Embeddings

Extracting embeddings from ImageBind model for all audio samples. Processing in batches to manage memory efficiently.

In [None]:
# Generating embeddings
if USE_IMAGEBIND:
    print("Generating embeddings with ImageBind...")
    print("This may take 3-5 minutes depending on GPU.")

    embeddings_list = []
    batch_size = 8  # Smaller batch for audio

    with torch.no_grad():
        for i in range(0, len(audio_files), batch_size):
            batch_files = audio_files[i:i+batch_size]

            # Loading audio using ImageBind's data loader
            inputs = {
                ModalityType.AUDIO: data.load_and_transform_audio_data(batch_files, device)
            }

            # Getting embeddings
            embeddings_batch = model(inputs)[ModalityType.AUDIO]
            embeddings_list.append(embeddings_batch.cpu())

            if (i // batch_size + 1) % 5 == 0:
                print(f"Processed {i + len(batch_files)}/{len(audio_files)} audio files...")

    embeddings = torch.cat(embeddings_list, dim=0).numpy()

else:
    # Fallback: Simple MFCC features
    print("Using MFCC features as fallback...")
    embeddings_list = []

    for i, audio_file in enumerate(audio_files):
        y, sr = librosa.load(audio_file, sr=22050, duration=5.0)
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
        mfccs_mean = np.mean(mfccs, axis=1)
        embeddings_list.append(mfccs_mean)

        if (i + 1) % 50 == 0:
            print(f"Processed {i+1}/{len(audio_files)} audio files...")

    embeddings = np.array(embeddings_list)

print(f"\nEmbeddings generated successfully!")
print(f"Embedding matrix shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")

## Clustering Audio Embeddings

Applying K-Means clustering on the audio embeddings to group acoustically similar sounds together.

In [None]:
# Applying K-Means clustering
n_clusters = 10

print(f"Applying K-Means clustering with k={n_clusters}...")
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(embeddings)

# Calculating metrics
silhouette_avg = silhouette_score(embeddings, cluster_labels)
ari = adjusted_rand_score(labels, cluster_labels)

print(f"\n{'='*50}")
print(f"Audio Clustering Results")
print(f"{'='*50}")
print(f"Number of clusters: {n_clusters}")
print(f"Cluster distribution: {np.bincount(cluster_labels)}")
print(f"Silhouette Score: {silhouette_avg:.3f}")
print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Inertia: {kmeans.inertia_:.2f}")

## Dimensionality Reduction with UMAP

Reducing high-dimensional audio embeddings to 2D for visualization while preserving the acoustic similarity structure.

In [None]:
# Reducing dimensions for visualization
print("Reducing dimensions with UMAP...")
reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
embeddings_2d = reducer.fit_transform(embeddings)

print(f"Dimensionality reduction complete!")
print(f"Original shape: {embeddings.shape}")
print(f"Reduced shape: {embeddings_2d.shape}")

## Visualizing Audio Clustering Results

Plotting the clustered audio samples in 2D space, comparing predicted clusters with true sound categories.

In [None]:
# Visualizing clusters
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Subplot 1: Predicted clusters
scatter1 = axes[0].scatter(embeddings_2d[:, 0], embeddings_2d[:, 1],
                          c=cluster_labels, cmap='tab10',
                          alpha=0.6, edgecolors='k', s=50)
axes[0].set_title(f'Predicted Clusters (K-Means, k={n_clusters})\nSilhouette: {silhouette_avg:.3f}',
                 fontsize=14, fontweight='bold')
axes[0].set_xlabel('UMAP Dimension 1', fontsize=12)
axes[0].set_ylabel('UMAP Dimension 2', fontsize=12)
plt.colorbar(scatter1, ax=axes[0], label='Cluster ID')

# Subplot 2: True categories
scatter2 = axes[1].scatter(embeddings_2d[:, 0], embeddings_2d[:, 1],
                          c=labels, cmap='tab10',
                          alpha=0.6, edgecolors='k', s=50)
axes[1].set_title(f'True Sound Categories\nARI: {ari:.3f}',
                 fontsize=14, fontweight='bold')
axes[1].set_xlabel('UMAP Dimension 1', fontsize=12)
axes[1].set_ylabel('UMAP Dimension 2', fontsize=12)
cbar = plt.colorbar(scatter2, ax=axes[1], label='Category ID')
cbar.set_ticks(range(10))
cbar.set_ticklabels([cat.replace('_', ' ')[:10] for cat in sorted(selected_categories)],
                    fontsize=8, rotation=45)

plt.tight_layout()
plt.show()

## Confusion Matrix Analysis

Examining how predicted clusters align with true sound categories to understand which sounds are acoustically similar and may be confused.

In [None]:
# Creating confusion matrix
cm = confusion_matrix(labels, cluster_labels)

plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=[f'C{i}' for i in range(n_clusters)],
            yticklabels=[cat.replace('_', ' ').title() for cat in sorted(selected_categories)])
plt.title(f'Confusion Matrix: True Categories vs Predicted Clusters\nARI: {ari:.3f}',
         fontsize=14, fontweight='bold')
plt.xlabel('Predicted Cluster', fontsize=12)
plt.ylabel('True Sound Category', fontsize=12)
plt.tight_layout()
plt.show()

# Analyzing cluster purity
print(f"\n{'='*60}")
print("CLUSTER PURITY ANALYSIS")
print(f"{'='*60}\n")

id_to_category = {v: k for k, v in category_to_id.items()}

for cluster_id in range(n_clusters):
    cluster_mask = cluster_labels == cluster_id
    cluster_true_labels = labels[cluster_mask]

    if len(cluster_true_labels) > 0:
        dominant_class = np.bincount(cluster_true_labels).argmax()
        purity = (cluster_true_labels == dominant_class).sum() / len(cluster_true_labels)

        print(f"Cluster {cluster_id}:")
        print(f"  Size: {len(cluster_true_labels)}")
        print(f"  Dominant category: {id_to_category[dominant_class]}")
        print(f"  Purity: {purity:.1%}")
        print()

## Analyzing Acoustic Similarities

Identifying which sound categories are most acoustically similar based on clustering results.

In [None]:
# Finding most confused category pairs
print(f"\n{'='*60}")
print("ACOUSTIC SIMILARITY ANALYSIS")
print(f"{'='*60}\n")

print("Category pairs frequently grouped together (acoustically similar):\n")

confusion_pairs = []
for i in range(len(cm)):
    for j in range(len(cm[0])):
        if cm[i][j] > 5:  # At least 5 samples grouped together
            confusion_pairs.append((id_to_category[i], j, cm[i][j]))

# Grouping by cluster
cluster_groups = {}
for cat, clust, count in confusion_pairs:
    if clust not in cluster_groups:
        cluster_groups[clust] = []
    cluster_groups[clust].append((cat, count))

for clust_id, categories in sorted(cluster_groups.items()):
    if len(categories) > 1:
        cat_names = [f"{cat} ({cnt})" for cat, cnt in categories]
        print(f"Cluster {clust_id}: {', '.join(cat_names)}")

print(f"\n{'='*60}")
print("Audio clustering with ImageBind embeddings complete!")
print(f"Achieved Silhouette Score: {silhouette_avg:.3f}")
print(f"Achieved Adjusted Rand Index: {ari:.3f}")
print(f"{'='*60}")

## Results Interpretation

**Clustering Performance:**
- **Adjusted Rand Index (ARI):** Measures agreement between predicted clusters and true categories. For audio data, scores above 0.4 indicate good acoustic similarity detection, as many sounds have overlapping spectral characteristics (e.g., dog bark vs. rooster crow both have sharp attacks; rain vs. waves both have continuous noise).
- **Silhouette Score:** Indicates cluster separation quality. Audio clustering typically shows moderate scores (0.1-0.3) due to natural acoustic overlap between categories. Environmental sounds exist on a continuum rather than discrete buckets.

**Why Audio Clustering is Challenging:**
Audio data is inherently complex with overlapping features across categories. Factors like recording quality, background noise, and acoustic environment add variability. Some sounds share frequency patterns (keyboard typing vs. rain), temporal patterns (clock tick vs. water drops), or timbral qualities (fire crackling vs. sea waves).

**ImageBind's Multimodal Advantage:**
ImageBind's training across multiple modalities (vision, audio, text) enables it to learn semantic relationships beyond raw acoustic features. The model understands that "dog" and "rooster" are both animals, or "rain" and "waves" are both water-related, creating embeddings that capture both acoustic AND semantic similarity.

**Key Takeaway:**
Multimodal LLM embeddings from ImageBind successfully discovered acoustic clusters without labeled training, demonstrating the power of pretrained models for audio understanding. The clustering reveals both acoustic similarities (spectral patterns) and semantic relationships (conceptual groupings) learned from cross-modal training.