~~~
Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
~~~

# Classifying sounds with HeAR and Wiki Commons Cough Data

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/google-health/hear/blob/master/notebooks/train_data_efficient_classifier.ipynb">
      <img alt="Google Colab logo" src="https://www.tensorflow.org/images/colab_logo_32px.png" width="32px"><br> Run in Google Colab
    </a>
  </td>  
  <td style="text-align: center">
    <a href="https://github.com/google-health/hear/blob/master/notebooks/train_data_efficient_classifier.ipynb">
      <img alt="GitHub logo" src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" width="32px"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://huggingface.co/google/hear">
      <img alt="HuggingFace logo" src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" width="32px"><br> View on HuggingFace
    </a>
  </td>
</tr></tbody></table>


This Colab notebook demonstrates how to use the HeAR (Health Acoustic Representations) model, directly from Hugging Face, to create and utilize embeddings from health-related audio. The notebook focuses on building a data-efficient cough classifier system using a small [Wikimedia Commons](https://commons.wikimedia.org/wiki/Commons:Welcome) dataset of relevant sounds.

Embeddings are compact, numerical representations of audio data that capture important features, making them suitable for training machine learning models with limited data and computational resources. Learn more about embeddings and their benefits at [this page](https://developers.google.com/health-ai-developer-foundations/hear).

#### Here's a breakdown of the notebook's steps:

1.  **Model Loading:** The HeAR model is loaded from the Hugging Face Hub (requires authentication with your Hugging Face account).

2.  **Dataset Creation:**
    *   **Wikimedia Commons Audio:** A small set of audio files is downloaded from Wikimedia Commons. This dataset includes examples of coughing, as well as other sounds like sneezing, breathing, laughter, and door knocking. The files are all publicly available under various Creative Commons licenses (details are available on Wikimedia Commons).
    *   **Microphone Recording:** The notebook provides functionality to record audio directly within Colab using your microphone. This allows you to add your own recordings to the dataset.

3.  **Embedding Generation:**
    *   **Preprocessing:** The downloaded and recorded audio files are loaded and processed using `librosa`. They are resampled to 16kHz (required by the HeAR model) and segmented into 2-second clips.
    *   **Inference:** The preprocessed 2-second audio clips are fed to the HeAR model to generate embeddings. Each clip produces a 512-dimensional HeAR embedding vector.
    *   **Visualization (Optional):** The notebook includes functions to display the audio waveform, Mel spectrogram, and an audio player for each file and its individual clips.

4.  **Classifier Training:**
    *   **Labeling:** A set of labels is manually created, associating each audio file with whether it contains a cough or not. For example, `Cough_1.ogg` is labeled as `True`, while `Laughter.ogg` is labeled as `False`.
    *   **Model Selection:** Several scikit-learn classifiers are used and can easily be expanded, including:
        *   Support Vector Machine (linear kernel)
        *   Logistic Regression
        *   Gradient Boosting
        *   Random Forest
        *   Multi-layer Perceptron (MLP)
    *   **Training:** Each classifier is trained using the generated HeAR embeddings and the corresponding cough labels. This demonstrates the data efficiency of using embeddings – these models train quickly with very little data.

5.  **Cough Classification:**
    *   **Test on New Example:** Test the classfier on held out cough or non-cough sound examples.
    *   **Test on New Recording:** The microphone recording function is used again to capture a new audio clip (presumably of the user coughing or not coughing).
    *   **Prediction:** The new clip is preprocessed, its embedding is generated using the HeAR model, and then each of the trained classifiers is used to predict whether the clip contains a cough.

6.  **Embedding Visualization:**
    *   **PCA Plot:** A plot visualizing the data points in a PCA space is presented to show how similar sounds are grouped together, as they have similar embeddings.
    *   **Barcode Visualization:** The embeddings are visualized as "barcodes". Each embedding is displayed as a row in a heatmap, showing the magnitude of each dimension after subtracting the global mean. This provides a visual representation of the embedding's structure.


## Authenticate with HuggingFace, skip if you have a HF_TOKEN secret

In [None]:
from huggingface_hub.utils import HfFolder

if HfFolder.get_token() is None:
    from huggingface_hub import notebook_login
    notebook_login()

## Setup HeAR Hugging Face Model

In [None]:
from huggingface_hub import from_pretrained_keras

# Load the model directly from Hugging Face Hub
loaded_model = from_pretrained_keras("google/hear")
# Inference function for embedding generation
infer = loaded_model.signatures["serving_default"]

# HeAR Parameters
SAMPLE_RATE = 16000  # Samples per second (Hz)
CLIP_DURATION = 2    # Duration of the audio clip in seconds
CLIP_LENGTH = SAMPLE_RATE * CLIP_DURATION  # Total number of samples


In [None]:
# @title Test Model Inference on Random Input
%%time
import numpy as np

# Generate Random Input Audio
NUM_EXAMPLES = 4  # number of random audio examples to generate
print(f"Generating {NUM_EXAMPLES} {CLIP_DURATION}s raw audio examples.")
raw_audio = np.random.normal(size=(NUM_EXAMPLES, CLIP_LENGTH))
print(f"Raw audio shape: {raw_audio.shape}, data type: {raw_audio.dtype}\n")

# Perform Inference Extract and Process the Embedding
print(f'Running HeAR model to produce {NUM_EXAMPLES} embeddings.')
output_dict = infer(x=raw_audio)
embedding = output_dict['output_0'].numpy()  # directly unpack as a NumPy array
print(f"Embedding shape: {embedding.shape}, data type: {embedding.dtype}")


## Download and Record Audio Data

 Wiki Commons
https://commons.wikimedia.org/wiki/Category:Coughing_audio


In [None]:
# @title Download Public Domain Cough Examples to Notebook
import os
import subprocess
from urllib.parse import urlparse

# More examples: https://commons.wikimedia.org/wiki/Category:Coughing_audio
wiki_cough_file_urls = [
  'https://upload.wikimedia.org/wikipedia/commons/c/cc/Man_coughing.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/6/6a/Cough_1.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/d/d9/Cough_2.ogg', # hold out for test
  'https://upload.wikimedia.org/wikipedia/commons/b/be/Woman_coughing_three_times.wav',
  'https://upload.wikimedia.org/wikipedia/commons/d/d0/Sneezing.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/b/bc/Windy_breath.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/e/ef/Laughter_and_clearing_voice.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/c/c6/Laughter.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/1/1c/Knocking_on_wood_or_door.ogg',
]

# Download the files.
files_map = {}  # file name to file path map
file_embeddings = {} # embedding cache
for url in wiki_cough_file_urls:
    filename = os.path.basename(urlparse(url).path)
    print(f'Downloading {filename}...')
    res = subprocess.run(['wget', '-nv', '-O', filename, url], capture_output=True, text=True)
    if res.returncode != 0:
        print(f"  Download failed. Return code: {res.returncode}\nError: {res.stderr}")
    files_map[filename] = url
print(f'\nLocal Files:\n{os.listdir():}\n')

In [None]:
# @title Microphone Helpers
from io import BytesIO
from base64 import b64decode
from google.colab import output
from IPython.display import Javascript

RECORD_JAVASCRIPT = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""
def record_microphone_and_save(duration_seconds=2, filename="output_audio", extension='.webm'):
  output_filename = filename + extension
  print(f"\nRecording for {duration_seconds} seconds...")
  display(Javascript(RECORD_JAVASCRIPT))
  base64_audio = output.eval_js('record(%d)' % (duration_seconds * 1000))
  print("Done Recording!")
  audio_bytes = b64decode(base64_audio.split(',')[1])

  # Save the audio to a file
  with open(output_filename, 'wb') as file:
      file.write(audio_bytes)
  print(f"Audio saved as {output_filename}")
  return output_filename

In [None]:
# @title Record your own file

recording_name = "my_recording" # will overwrite existing
recording_file = record_microphone_and_save(duration_seconds=CLIP_DURATION, filename=recording_name)
files_map[recording_file] = recording_name # add to file map from above


## Model Inference

In [None]:
# @title Plot Helpers
import os
import librosa
import matplotlib.pyplot as plt
import librosa.display
from IPython.display import Audio
import matplotlib.cm as cm
import warnings

# Suppress the specific warning
warnings.filterwarnings("ignore", category=UserWarning, module="soundfile")
warnings.filterwarnings("ignore", module="librosa")


def plot_waveform(sound, sr, title, figsize=(12, 4), color='blue', alpha=0.7):
  """Plots the waveform of the audio using librosa.display."""
  plt.figure(figsize=figsize)
  librosa.display.waveshow(sound, sr=sr, color=color, alpha=alpha)
  plt.title(f"{title}\nshape={sound.shape}, sr={sr}, dtype={sound.dtype}")
  plt.xlabel("Time (s)")
  plt.ylabel("Amplitude")
  plt.grid(True)
  plt.tight_layout()
  plt.show()


def plot_spectrogram(sound, sr, title, figsize=(12, 4), n_fft=2048, hop_length=256, n_mels=128, cmap='nipy_spectral'):
  """Plots the Mel spectrogram of the audio using librosa."""
  plt.figure(figsize=figsize)
  mel_spectrogram = librosa.feature.melspectrogram(y=sound, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)
  log_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
  librosa.display.specshow(log_mel_spectrogram, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel', cmap=cmap)
  plt.title(f"{title} - Mel Spectrogram")
  plt.tight_layout()
  plt.show()


In [None]:
# @title Load Audio and Generate HeAR Embeddings
%%time

# Audio display options
SHOW_WAVEFORM = False
SHOW_SPECTROGRAM = True
SHOW_PLAYER = True
SHOW_CLIPS = False

# Clips of length CLIP_DURATION seconds are extracted from the audio file
# using a sliding window. Adjecent clips are overlapped by CLIP_OVERLAP_PERCENT.
CLIP_OVERLAP_PERCENT = 10

# When True, if a clip extracted from the file is quieter than
# the SILENCE_RMS_THRESHOLD_DB it is not sent to the HeAR model.
CLIP_IGNORE_SILENT_CLIPS = True
# Maximum average amplitude of a frame to be considered silence.
SILENCE_RMS_THRESHOLD_DB = -50


for file_key, file_url in files_map.items():
  # Load the audio file into numpy array with specified sample rate and 1 channel (mono).
  print(f"\nLoading file: {file_key} from {file_url}")
  audio, sample_rate = librosa.load(file_key, sr=SAMPLE_RATE, mono=True)

  # Display audio file (optional)
  if SHOW_WAVEFORM:
    plot_waveform(audio, sample_rate, title=file_key, color='blue')
  if SHOW_SPECTROGRAM:
    plot_spectrogram(audio, sample_rate, file_key,  n_fft=2*1024, hop_length=64, n_mels=256, cmap='Blues')
  if SHOW_PLAYER:
    display(Audio(data=audio, rate=sample_rate))

  # This code segments an audio array into overlapping clips.
  # It calculates the number of clips, iterates through them,
  # and handles potential padding with zeros for the last clip if needed.
  clip_batch = []
  overlap_samples = int(CLIP_LENGTH * (CLIP_OVERLAP_PERCENT / 100))
  step_size = CLIP_LENGTH - overlap_samples
  num_clips = max(1, (len(audio) - overlap_samples) // step_size)
  print(f" Segmenting into {num_clips} {CLIP_DURATION}s clips")
  for i in range(num_clips):
    start_sample = i * step_size
    end_sample = start_sample + CLIP_LENGTH
    clip = audio[start_sample:end_sample]
    # Pad clip with zeros if less than the required CLIP_LENGTH.
    if end_sample > len(audio):
        print("  Last clip: Padding with zeros.")
        clip = np.pad(clip, (0, CLIP_LENGTH - len(clip)), 'constant')
    # Average Loudness of the clip(in dB)
    rms_loudness =  round(20 * np.log10(np.sqrt(np.mean(clip**2))))

    # Display clip info (optional)
    clip_str = f"Clip {i+1} from {file_key} [loudness: {rms_loudness} dB]"
    print(f"  {clip_str}")
    if SHOW_CLIPS:
      if SHOW_WAVEFORM:
        plot_waveform(clip, sample_rate, title=clip_str, figsize=(8, 3), color=cm.rainbow(i /num_clips))
      if SHOW_PLAYER:
        display(Audio(data=clip, rate=sample_rate))

    # Skip if clip is too quiet
    if CLIP_IGNORE_SILENT_CLIPS and rms_loudness < SILENCE_RMS_THRESHOLD_DB:
      print(f"  Clip {i+1} Skip...too quiet [loudness: {rms_loudness} dB]")
      continue

    # Add clip to batch
    clip_batch.append(clip)


  # Perform HeAR Batch inference to extract the associated clip embedding.
  # Only run inference if embedding not already in file_embedding cache.
  clip_batch = np.asarray(clip_batch)
  if file_key not in file_embeddings:
    print("  Clip not in cache, performing inference...")
    embedding_batch = infer(x=clip_batch)['output_0'].numpy()
    file_embeddings[file_key] = embedding_batch
  else:
    embedding_batch = file_embeddings[file_key]
  print(f"  Embedding batch shape: {embedding_batch.shape}, data type: {embedding_batch.dtype}")


In [None]:
# @title Collect Embeddings, Create Training Set

# Hold out example for testing with cough classifier later.
# Only using one test file for now, can also record your own.
# Note: `Cough_2.ogg` is likely the same person as `Cough_1.ogg`,
#  so we expect it to produce a similar embedding
test_file = 'Cough_2.ogg'
assert test_file in files_map, f"Test file '{test_file}' not found in files_map."


# Combine train embeddings and hold out test embeddings.
test_embeddings, test_file_names = [], [] # held out
train_embeddings, train_file_names = [], []
for file_key, embedding_batch in file_embeddings.items():
  for embedding in embedding_batch:
    if file_key == test_file:
      test_embeddings.append(embedding)
      test_file_names.append(file_key)
    else:
      train_embeddings.append(embedding)
      train_file_names.append(file_key)
train_embeddings = np.array(train_embeddings)
train_file_set = set(train_file_names)
test_file_set = set(test_file_names)

print(f"Train embeddings have shape: {train_embeddings.shape}, data type: {train_embeddings.dtype}")
print(f"Train Embeddings are from {len(train_file_set)} unique files:{train_file_set}")
print(f"Test Embeddings are from {len(test_file_set)} unique files:{test_file_set}")



## Use Embeddings

In [None]:
# @title Plot Train Embeddings, show average Embedding per file
from sklearn.decomposition import PCA

# Fit PCA
pca = PCA(n_components=2)
pca_embeddings = pca.fit_transform(train_embeddings)

# Calculate average embedding per file after PCA, mark as star
avg_embeddings_per_file_pca = {}
for file_key in train_file_set:
    file_indices = [i for i, key in enumerate(train_file_names) if key == file_key]
    avg_embeddings_per_file_pca[file_key] = np.mean(pca_embeddings[file_indices], axis=0)

# Plot with coloring and average embedding
plt.figure(figsize=(10, 10))
colors = cm.rainbow(np.linspace(0, 1, len(train_file_set)))
color_map = {key: colors[i] for i, key in enumerate(train_file_set)}
for i, embedding in enumerate(pca_embeddings):
    file_key = train_file_names[i]
    plt.scatter(embedding[0], embedding[1], color=color_map[file_key], alpha=0.5)  # No label for each dot

# Add average embeddings as star markers (using PCA averages)
for file_key, avg_embedding in avg_embeddings_per_file_pca.items():
    plt.scatter(avg_embedding[0], avg_embedding[1], marker='*', color=color_map[file_key], label=file_key, s=400)  # Label for average

plt.xlabel("PCA Dimension 1")
plt.ylabel("PCA Dimension 2")
plt.title("Embeddings with PCA, File Coloring, and PCA Average Embedding Markers")
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.grid(True)
plt.show()

In [None]:
# @title Train a few-shot cough classifier with HeAR embeddings
%%time
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.gaussian_process import GaussianProcessClassifier

# True if file has coughing (not perfect as some parts of file may not have coughing)
file_cough_labels = {
    'Laughter.ogg': False,
    'Cough_1.ogg': True,
    'Windy_breath.ogg': False,
    'Man_coughing.ogg': True,
    'Cough_2.ogg': True,
    'Woman_coughing_three_times.wav': True,
    'Short_coughs.ogg': True,
    'Laughter_and_clearing_voice.ogg': False,
    'Sneezing.ogg': False,
    'Knocking_on_wood_or_door.ogg': False,
    'recording.webm': True,
    'my_recording.webm': True,

}
cough_labels = []
for file_name in train_file_names:
  if file_name in file_cough_labels:
    cough_labels.append(1 if file_cough_labels[file_name] else 0)
  elif "cough" in file_name.lower():
    cough_labels.append(1)
  else:
    cough_labels.append(0)
    print(f"Warning: No label found for '{file_name}'. Defaulting to False.")

# Train more powerful classifier models
models = {
    "Support Vector Machine (linear)": SVC(kernel='linear'),
    "Logistic Regression": LogisticRegression(),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=128),
    "Random Forest": RandomForestClassifier(n_estimators=128),
    "MLP Classifier": MLPClassifier(hidden_layer_sizes=(128, 64)),
}

cough_models = {}
for name, model in models.items():
  model.fit(train_embeddings, cough_labels)
  cough_models[name] = model
  print(f"Finished training: {name}")


In [None]:
# @title Classify Held Out Test Example

print(f"Classifying {len(test_embeddings)} embeddings from {test_file} with the {len(cough_models)} models...")
for model_name, cough_model in cough_models.items():
  # Note: Since the clip is divided into CLIP_DURATION length subclips, some
  # clips will contain the cough while others wont. Since we want to know if
  # ANY clip from this test file contains a cough, can check max(predcitions).
  # If we want to know where in the clip the coughs occur we can look at
  # which clip indices classified 1 (cough)
  prediction = cough_model.predict(test_embeddings).max()
  print(f" {model_name} Classification: {'Cough' if prediction == 1 else 'No Cough'}")

In [None]:
# @title Record and Classify Cough
recording_filename = "my_test_cough" # will overwrite existing
recording_file = record_microphone_and_save(duration_seconds=CLIP_DURATION, filename=recording_filename)
recording_clip = librosa.load(recording_file, sr=SAMPLE_RATE)[0]
print(f"Loaded Test file {recording_filename}, audio has shape: {recording_clip.shape}")

print(f"Generate HeAR embedding for {recording_filename}")
# Note: Since the recording is exactly CLIP_DURATION seconds, we will have a
# single clip and produce a single embedding.
recording_batch = np.expand_dims(np.pad(recording_clip, (0, CLIP_LENGTH - len(recording_clip)), 'constant'), axis=0)
recording_embedding = infer(x=recording_batch)['output_0'].numpy()
print(f"Embedding has shape: {recording_embedding.shape}")

# Classify recorded file with each classifier.
print(f"\nClassifying test file: {recording_filename} using {len(cough_models)} models...")
for model_name, cough_model in cough_models.items():
  # Note: Similar to the above held out example, we will have a prediction for
  # each clip within the file, in this case we have one clip from the recording.
  prediction = cough_model.predict(recording_embedding).max() # or [0]
  print(f" {model_name} Classification: {'Cough' if prediction == 1 else 'No Cough'}")

# Player for recorded clip
Audio(data=recording_clip, rate=SAMPLE_RATE)


In [None]:
# @title Plot Embeddings as Barcode Figures

# Note: We subtract the mean embedding so plots highlight the differences.
embedding_mean = np.mean(train_embeddings, axis=0)
for file_key, embedding_batch in file_embeddings.items():
  batch_size = embedding_batch.shape[0]
  embedding_batch_norm = embedding_batch - embedding_mean
  print(f"{file_key} has {batch_size} embeddings...")

  plt.figure(figsize=(18, 1 * embedding_batch.shape[0]))
  for i in range(batch_size):
    embedding_magnitude = embedding_batch_norm[i, :] ** 2
    plt.subplot(batch_size, 1, i + 1)
    plt.imshow(embedding_magnitude.reshape(1, -1), cmap='binary',  interpolation=None, aspect='auto')
    plt.title(f"Embedding {i+1}, File: {file_key}")
    plt.xticks([])
    plt.yticks([])
  plt.tight_layout()
  plt.show()

# Next steps

Explore the other [notebooks](https://github.com/google-health/hear/blob/master/notebooks) to learn what else you can do with the model.