# CLAP Demo: Sound-to-Sound and Text-to-Sound Similarity and Classification

This notebook demonstrates how to use CLAP (Contrastive Language-Audio Pre-training) to:
1. Find similar sounds in the ESC-50 dataset
2. Find sounds relevant for a text description
2. Perform zero-shot audio classification

CLAP learns joint representations of audio and text, enabling powerful audio understanding tasks.

## Setup and Installation

In [None]:
# NOTE: you need to have ffmpeg<8 installed with the dynamic library
# libavutil included for this to work
import torchcodec

import pandas as pd
import torch
import numpy as np
import librosa
from transformers import ClapModel, ClapProcessor
from datasets import load_dataset
from sklearn.metrics.pairwise import cosine_similarity
import random
from IPython.display import Audio, display

In [None]:
# Udacity workspace only. Comment out if running locally. 
# Fix for huggingface datasets when dealing with
# read-only file-systems
import filelock
import contextlib

# Create a proper no-op FileLock class
class NoOpFileLock:
    def __init__(self, lock_file, *args, **kwargs):
        pass
    
    def __enter__(self):
        return self
    
    def __exit__(self, *args, **kwargs):
        pass

# Replace FileLock with a no-op context manager
filelock.FileLock = NoOpFileLock

## Load CLAP Model and ESC-50 Dataset

In [2]:
# Load CLAP model and processor
model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")

# Load ESC-50 dataset
dataset = load_dataset("ashraq/esc50", split="train")
print(len(dataset))
print(f"Loaded {len(dataset)} audio samples")

# Show available categories
pd.Series(sorted(set(dataset["category"]))).to_frame("categories")

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Repo card metadata block was not found. Setting CardData to empty.


2000
Loaded 2000 audio samples


Unnamed: 0,categories
0,airplane
1,breathing
2,brushing_teeth
3,can_opening
4,car_horn
5,cat
6,chainsaw
7,chirping_birds
8,church_bells
9,clapping


## Helper Functions

In [4]:
def get_audio_embedding(audio_array, sample_rate=22050):
    """Get CLAP embedding for an audio sample."""

    inputs = processor(audios=audio_array, sampling_rate=48000, return_tensors="pt")
    
    with torch.no_grad():
        audio_embed = model.get_audio_features(**inputs)
    
    return audio_embed.numpy()


def get_text_embedding(text):
    """Get CLAP embedding for text."""
    inputs = processor(text=text, return_tensors="pt")
    with torch.no_grad():
        text_embed = model.get_text_features(**inputs)
    return text_embed.numpy()

## Pre-compute embeddings

First we need to compute the embeddings for each sound:

In [None]:
def process_audio_batch(batch):
    # Extract arrays from the audio column
    audio_arrays = [audio['array'] for audio in batch]
    
    return {
        'embedding': get_audio_embedding(audio_arrays, sample_rate=22050)
    }

# Process in batches
processed_dataset = dataset.map(
    process_audio_batch,
    batched=True,
    batch_size=32,
    input_columns=['audio'],
    cache_file_name="/tmp/processed_esc1.arrow"
)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## Finding Similar Sounds

Here we show how to search for sounds similar to a query sound:

In [6]:
# Select a random sample as our query

random.seed(999)  # For reproducibility
query_idx = random.randint(0, len(dataset) - 1)
query_sample = dataset[query_idx]
query_audio = query_sample["audio"]["array"]
query_sr = query_sample["audio"]["sampling_rate"]

print(f"Query sound: {query_sample['filename']} (Category: {query_sample['category']})")
display(
    Audio(
        data=query_audio,
        autoplay=False,
        rate=query_sr,
    )
)

# Get embedding for query audio
query_embedding = get_audio_embedding(query_audio, query_sr)

Query sound: 5-103415-A-2.wav (Category: pig)


Now we need to compute the cosine similarity of this embedding with all other embeddings:

$$
\text{cosine similarity} = \frac{A \times B^T}{||A||~||B||} 
$$

In [9]:
# Normalize the query embedding
A = query_embedding / np.linalg.norm(query_embedding, axis=1, keepdims=True)

# Normalize all dataset embeddings
B_ = np.vstack(processed_dataset["embedding"])
B = B_ / np.linalg.norm(B_, axis=1, keepdims=True)

# Compute cosine similarities
# Using dot product since embeddings are normalized
# NOTE: could also use sklearn.metrics.pairwise.cosine_similarity
# Also, query_embedding is 2D (1, dim), so this is matrix multiplication
similarities = (
    np.matmul(A, B.T)
).squeeze()

# Add the similarity scores to the dataset and sort
ds_with_similarity = processed_dataset.add_column("similarity", similarities).sort(
    "similarity", reverse=True
)

Finally, let's hear what we found:

In [10]:
#  Display top 3 most similar sounds
print("\nTop 3 most similar sounds:")
# NOTE: we exclude the first one because it's the query itself
# (since the sound we chose as query is part of the dataset)
for row in ds_with_similarity.select(range(1, 4)):
    print(f"{row['filename']} ({row['category']}) - Similarity: {row['similarity']:.3f}")
    display(
        Audio(
            data=row['audio']['array'],
            autoplay=False,
            rate=row['audio']['sampling_rate'],
        )
    )


Top 3 most similar sounds:
3-253084-A-2.wav (pig) - Similarity: 0.883


1-260640-B-2.wav (pig) - Similarity: 0.882


5-103421-A-2.wav (pig) - Similarity: 0.878


## Text-to-Audio Search

Finally, we show how to search audio using text:

In [30]:
# Search for audio using text descriptions
search_query = "A hike in the woods"
query_text_embedding = get_text_embedding(search_query)

print(f"Searching for: '{search_query}'")
print("-" * 30)

# Here we use instead the sklearn function for variety
similarities = cosine_similarity(
    query_text_embedding,
    np.vstack(processed_dataset["embedding"])
).squeeze()

# Add the similarity scores to the dataset and sort
ds_with_similarity = processed_dataset.add_column(
    "similarity", 
    similarities
).sort("similarity", reverse=True)

for row in ds_with_similarity.select(range(1, 4)):
    print(f"{row['filename']} ({row['category']}) - Similarity: {row['similarity']:.3f}")
    display(
        Audio(
            data=row['audio']['array'],
            autoplay=False,
            rate=row['audio']['sampling_rate'],
        )
    )

Searching for: 'A hike in the woods'
------------------------------
3-94344-A-25.wav (footsteps) - Similarity: 0.334


3-103599-A-25.wav (footsteps) - Similarity: 0.326


3-103599-B-25.wav (footsteps) - Similarity: 0.325


## Zero-Shot Classification

Here we show how we can classify arbitrary sounds with arbitrary labels without retaining (zero-shot).

Let's load some sample sounds:

In [33]:
import glob
from pathlib import Path

sounds = Path("sounds").glob("*.mp3")
ground_truth = [x.stem for x in sounds]
audio_arrays = [librosa.load(f, sr=48000)[0] for f in glob.glob("sounds/*.mp3")]

for audio, gt in zip(audio_arrays, ground_truth):
    print(f"Ground truth: {gt}")
    display(
        Audio(
            data=audio,
            autoplay=False,
            rate=48000,
        )
    )

Ground truth: jackhammer


Ground truth: angle-grinder


Ground truth: power-drill


And now let's use a HF transformers `pipeline` to do the classification:

In [34]:
from transformers import pipeline


# Define some category descriptions
category_descriptions = ["drill", "angle grinder", "jackhammer"]

audio_classifier = pipeline(
    task="zero-shot-audio-classification", model="laion/clap-htsat-unfused"
)
output = audio_classifier(audio_arrays, candidate_labels=category_descriptions, top_k=1)

# Let's display the results using pandas
df = pd.DataFrame.from_records([{x["label"]: x["score"] for x in y} for y in output])
df["winning_label"] = df.idxmax(axis=1)
df["ground_truth"] = ground_truth
df

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Device set to use mps:0


Unnamed: 0,jackhammer,angle grinder,drill,winning_label,ground_truth
0,0.844018,0.149744,0.006238,jackhammer,jackhammer
1,0.000938,0.997089,0.001973,angle grinder,angle-grinder
2,0.00012,0.002454,0.997426,drill,power-drill


## Key Takeaways

CLAP enables powerful audio-text understanding by learning shared representations:

1. **Sound Similarity**: Find acoustically similar audio clips without manual feature engineering
2. **Zero-Shot Classification**: Classify audio using natural language descriptions
3. **Text-to-Audio Search**: Search audio databases using text queries

This demonstrates the power of multimodal models for audio understanding tasks.