~~~markdown
Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
~~~

# HeAR Event Detector Demo
<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/google-health/hear/blob/master/notebooks/hear_event_detector_demo.ipynb">
      <img alt="Google Colab logo" src="https://www.tensorflow.org/images/colab_logo_32px.png" width="32px"><br> Run in Google Colab
    </a>
  </td>  
  <td style="text-align: center">
    <a href="https://github.com/google-health/hear/blob/master/notebooks/hear_event_detector_demo.ipynb">
      <img alt="GitHub logo" src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" width="32px"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://huggingface.co/google/hear">
      <img alt="HuggingFace logo" src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" width="32px"><br> View on HuggingFace
    </a>
  </td>
</tr></tbody></table>


This Colab notebook demonstrates using the HeAR (Health Acoustic Representations) model along with the included Health Event Detectors directly from Hugging Face, to identify audio clips with relevent health sounds such as coughing, breathing or sneezing, then create and utilize embeddings from this subset of health-related audio clips.

This notebook is similar to `train_data_efficient_classifier.ipynb` and also uses the small [Wikimedia Commons](https://commons.wikimedia.org/wiki/Commons:Welcome) dataset of relevant health sounds. In this example the audio files are reduced to a smaller subset of clips using the event detector to identify clips containing interesting health sounds to embed with HeAR.



#### This notebook demonstrates:

1.  Loading all supported Hugging Face Models (HeAR, Event Detector and Frontend).

2.  Detecting 2-second clips within the [Wikimedia Commons](https://commons.wikimedia.org/wiki/Commons:Welcome) dataset with high probability of containing one or more of the supported event detection labels, then generating HeAR embedddings for these clips.

3.  Finding the most similar audio files to a given query audio file based on the Cosine Similarity between the respective HeAR embeddings of the audio files.

4. Optimizing the event detectors and frontend for low latency on-device usage using TFLite to support large scale feature event detection or feature generation

# Authenticate with HuggingFace, skip if you have a HF_TOKEN secret

In [None]:
from huggingface_hub.utils import HfFolder

if HfFolder.get_token() is None:
    from huggingface_hub import notebook_login
    notebook_login()

# Clone HuggingFace repository snapshot

This will store the HeAR and event detector models in local cache so they can be loaded later.



In [None]:
import numpy as np
import tensorflow as tf

print("TensorFlow version:", tf.__version__)
print("Keras version:", tf.keras.__version__)

from huggingface_hub import snapshot_download
hugging_face_repo = "google/hear"
local_snapshot_path = snapshot_download(repo_id=hugging_face_repo)
print(f"Saved {hugging_face_repo} to {local_snapshot_path}\n")


# Download Audio Data

 Wiki Commons
https://commons.wikimedia.org/wiki/Category:Coughing_audio


In [None]:
# @title Download Public Domain Cough Examples to Notebook
import os
import subprocess
from urllib.parse import urlparse

# More examples: https://commons.wikimedia.org/wiki/Category:Coughing_audio
wiki_cough_file_urls = [
  'https://upload.wikimedia.org/wikipedia/commons/c/cc/Man_coughing.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/6/6a/Cough_1.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/d/d9/Cough_2.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/b/be/Woman_coughing_three_times.wav',
  'https://upload.wikimedia.org/wikipedia/commons/d/d0/Sneezing.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/e/ef/Laughter_and_clearing_voice.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/c/c6/Laughter.ogg',
  'https://upload.wikimedia.org/wikipedia/commons/1/1c/Knocking_on_wood_or_door.ogg',
]

# Download the files.
files_map = {}  # file name to file path map
for url in wiki_cough_file_urls:
  filename = os.path.basename(urlparse(url).path)
  print(f'Downloading {filename}...')
  res = subprocess.run(['wget', '-nv', '-O', filename, url], capture_output=True, text=True)
  if res.returncode != 0:
      print(f"  Download failed. Return code: {res.returncode}\nError: {res.stderr}")
  files_map[filename] = url
print(f'\nLocal Files:\n{os.listdir():}\n')

# Load Models and Run Inference

### HeAR Model

The HeAR model uses a powerful [ViT](https://huggingface.co/docs/transformers/en/model_doc/vit) backbone and generates rich 512 length embeddings from a 2 second single channel 16kHz audio clip. See the [HuggingFace Model Card](https://huggingface.co/google/hear) for more details.

### Event Detector Models

The event detector models use the efficient [MobileNet-V3](https://huggingface.co/docs/timm/en/models/mobilenet-v3) backbone paired with our custom TensorFlow spectrogram frontend.

As with HeAR, these models expect a 2 second single channel 16kHz audio clip and output 8 detection probability scores for the following labels:

```
['Cough', 'Snore', 'Baby Cough', 'Breathe', 'Sneeze','Throat Clear', 'Laugh', 'Speech']
```

The event detector has two size variants which can be used interchangeably depending on the use-case.

*  `event_detector_small/` is based on [MobileNetV3Small](https://www.tensorflow.org/api_docs/python/tf/keras/applications/MobileNetV3Small) with approximately **1M** parameters (3.60 MB)

*  `event_detector_large/` is based on [MobileNetV3Large](https://www.tensorflow.org/api_docs/python/tf/keras/applications/MobileNetV3Large) with approximately **3M** parameters (11.46 MB)

#### Spectrogram Frontend

Our event detectors are fused with a custom, on-device optimized spectrogram frontend which efficiently converts 2 seconds of 16kHz audio into [PCEN](https://research.google/pubs/trainable-frontend-for-robust-and-far-field-keyword-spotting) scaled [Mel-spectrogram](https://huggingface.co/learn/audio-course/en/chapter1/audio_data#mel-spectrogram) features with 200 time steps and 48 Mel-frequency bins.

*  `spectrogram_frontend/` is based on the PCEN implementation from [LEAF](https://research.google/blog/leaf-a-learnable-frontend-for-audio-classification/) and has only **5k** non-trainable parameters (18.56 KB) which are frozen and not configurable.

We provide a standalone version of this frontend so that features can be pre-computed for new event detector training applications. See **Extract Batch Frontend Spectrogram Features** section for usage examples.


In [None]:
# @title Load HeAR, Event Detector and Frontend Models
from huggingface_hub import from_pretrained_keras

# Constants for all included models, each input should be 32,000 samples.
SAMPLE_RATE = 16000
CLIP_DURATION = 2

# Select event detector variant
EVENT_DETECTOR = "event_detector_small" # @param ["event_detector_large", "event_detector_small"]
# Included event detectors are trained to doutput detection probabilities in this order.
LABEL_LIST =  ['Cough', 'Snore', 'Baby Cough', 'Breathe', 'Sneeze', 'Throat Clear', 'Laugh', 'Speech']

# HeAR Embedding Model
print(f"\nLoading HeAR model")
hear_model = from_pretrained_keras(local_snapshot_path)
hear_infer = hear_model.signatures["serving_default"]

# Event detector models and frontend are nested in the "event_detector/" folder.
# Detector Frontend Model for efficiently computing spectrogram feature
frontend_path = os.path.join("event_detector/", "spectrogram_frontend")
print(f"\nLoading frontend model from: {frontend_path}")
frontend_model = from_pretrained_keras(
    os.path.join(local_snapshot_path, frontend_path)
)

# Detector Model based on size variant selection, frontend included
event_detector_path = os.path.join("event_detector/", EVENT_DETECTOR)
print(f"\nLoading detector model from: {event_detector_path}")
event_detector = from_pretrained_keras(
    os.path.join(local_snapshot_path, event_detector_path)
)


In [None]:
# @title Plot Helpers
import librosa
import matplotlib.pyplot as plt
import librosa.display
from IPython.display import Audio
import matplotlib.cm as cm
import warnings

# Suppress the specific warning
warnings.filterwarnings("ignore", category=UserWarning, module="soundfile")
warnings.filterwarnings("ignore", module="librosa")

def plot_waveform(sound, sr, title, figsize=(12, 4), color='blue', alpha=0.7):
  """Plots the waveform of the audio using librosa.display."""
  plt.figure(figsize=figsize)
  librosa.display.waveshow(sound, sr=sr, color=color, alpha=alpha)
  plt.title(f"{title}\nshape={sound.shape}, sr={sr}, dtype={sound.dtype}")
  plt.xlabel("Time (s)")
  plt.ylabel("Amplitude")
  plt.grid(True)
  plt.tight_layout()
  plt.show()


def plot_spectrogram(sound, sr, title, figsize=(12, 4), n_fft=2048, hop_length=256, n_mels=128, cmap='nipy_spectral'):
  """Plots the Mel spectrogram of the audio using librosa."""
  plt.figure(figsize=figsize)
  mel_spectrogram = librosa.feature.melspectrogram(y=sound, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)
  log_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
  librosa.display.specshow(log_mel_spectrogram, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel', cmap=cmap)
  plt.title(f"{title} - Mel Spectrogram")
  plt.tight_layout()
  plt.show()

In [None]:
# @title Event Detector Plot Helpers

def plot_frontend_feature(
    frontend_feature: np.ndarray,
    title: str,
    figsize: tuple[int, int] = (12, 4),
    cmap: str = 'nipy_spectral',
) -> None:
  """Plots the frontend spectrogram input feature.

  Args:
    frontend_feature: The event detector frontend feature as a 2D NumPy array
      with shape (number of time steps, number of frequency bins). The default
      shape for the included event detectors is (200, 48), which represents a
      2-second audio clip with 48 frequency bins.
    title: The title prefix of the plot.
    figsize: Optional size of the figure.
    cmap: Optional colormap to use.
  """
  # Frontend features are typically rotated when fed into the model
  # for spectrogram visualization it is more standard for x axis to be time
  audio_spectrogram = np.rot90(frontend_feature)
  plt.figure(figsize=figsize)
  plt.imshow(audio_spectrogram, aspect='auto', cmap=cmap)
  plt.title(f"{title} - Frontend PCEN Mel Spectrogram")
  plt.tight_layout()
  plt.show()

def plot_detection_scores(
    scores_batch: np.ndarray,
    label_list: list[str],
    title: str,
    figsize: tuple[int, int] = (12, 4),
    cmap: str = 'nipy_spectral',
) -> None:
  """Plots per-label detection scores for batch sequentially with a consistent color scale.

  Args:
    scores_batch: The event detection scores as a 2D NumPy array with shape
      (number of clips, number of labels). Where number of labels should match
      the length of the label_list, which is 8 for the included event detectors.
    label_list: A list of labels representing the event detector classes.
    title: The title prefix of the plot.
    figsize: Optional size of the figure.
    cmap: Optional colormap to use.
  """
  plt.figure(figsize=figsize)
  scores_img = np.transpose(scores_batch)
  # Explicitly set the color limits for imshow
  im = plt.imshow(scores_img, aspect='auto', cmap=cmap, vmin=0, vmax=1)
  # Set up the 'y' label axis
  plt.yticks(
      np.arange(len(label_list)), [l.replace(' ', '\n') for l in label_list]
  )
  # Add horizontal grid lines between labels
  for i in range(1, scores_img.shape[0]):
    plt.axhline(y=i - 0.5, color='gray', linestyle='--')
  plt.grid(axis='y', which='major', color='white', alpha=0)
  # Setup the 'x' time axis
  n_clips = scores_img.shape[1]
  plt.xticks(np.arange(n_clips), [f'Clip {i+1}' for i in range(n_clips)])
  plt.xlabel("Time Step")
  # Add vertical grid lines between time steps
  for j in range(1, n_clips):
    plt.axvline(x=j - 0.5, color='gray', linestyle='--')
  plt.title(f"{title} - Sound Event Detections")
  # Add colorbar with a consistent scale from 0 to 1
  plt.colorbar(im, ticks=[0, 0.2, 0.4, 0.6, 0.8, 1.0])
  plt.tight_layout()
  plt.show()

In [None]:
# @title Load Audio and Generate HeAR Embeddings
%%time

# Audio display options
SHOW_WAVEFORM = False
SHOW_SPECTROGRAM = False
SHOW_PLAYER = True
SHOW_DETECTION_SCORES = True
COLORMAP = "Blues"

# Keep clips with high detection scores for these respiratory labels
# then embed the clips using HeAR. Ignore clips with no detections
LABELS_TO_EMBED = ['Cough', 'Snore', 'Breathe', 'Sneeze']
# Assert that all labels to embed are actually present in the main label list
assert all(label in LABEL_LIST for label in LABELS_TO_EMBED)

# Clips of length CLIP_DURATION seconds are extracted from the audio file
# using a sliding window. Adjecent clips are overlapped by CLIP_OVERLAP_PERCENT.
CLIP_OVERLAP_PERCENT = 10

# Labels must have score above this threshold to be considered a detection
DETECTION_THRESHOLD = 0.9

frame_length = int(CLIP_DURATION * SAMPLE_RATE)
frame_step = int(frame_length * (1 - CLIP_OVERLAP_PERCENT / 100))
hear_embeddings = {}
for file_key, file_url in files_map.items():
  hear_embeddings[file_key] = {}
  print(f"\nLoading file: {file_key} from {file_url}")
  audio, sample_rate = librosa.load(file_key, sr=SAMPLE_RATE, mono=True)

  # Display full audio file (optional).
  if SHOW_WAVEFORM:
    plot_waveform(audio, sample_rate, title=file_key, color='blue')
  if SHOW_SPECTROGRAM:
    plot_spectrogram(
      audio, sample_rate, title=file_key, n_fft=2*1024, hop_length=64, n_mels=256, cmap=COLORMAP)
  if SHOW_PLAYER:
    display(Audio(data=audio, rate=sample_rate))

  # Segment an audio array into fixed length overlapping clips.
  if len(audio) < frame_length:
    audio = np.pad(audio, (0, frame_length - len(audio)), mode='constant')
  audio_clip_batch = tf.signal.frame(audio, frame_length, frame_step )
  print(f"Number of audio clips in batch: {len(audio_clip_batch)}.")

  # Perform detector inference on the audio_clip_batch
  # The model will generate the input feature, then infer the detection
  print(f"Running batched {EVENT_DETECTOR} model inference on audio clips.")
  detection_scores_batch = event_detector(audio_clip_batch)["scores"].numpy()
  print("Computed batch probability scores with shape:",
        f"{detection_scores_batch.shape} from input audio clips batch with",
        f"shape: {audio_clip_batch.shape}"
  )
  hear_embeddings[file_key]['detections'] = detection_scores_batch

  if SHOW_DETECTION_SCORES:
    plot_detection_scores(detection_scores_batch, LABEL_LIST, title=f'{file_key}: {EVENT_DETECTOR}', cmap=COLORMAP)

  # Filter clips for HeAR inference based on if in ANY detection scores for
  # 'LABELS_TO_EMBED' are above the 'DETECTION_THRESHOLD'.
  print(f"Filtering clips based on detections for labels: {LABELS_TO_EMBED}")
  embed_hear_clips = []
  for clip_i, scores in enumerate(detection_scores_batch):
    for label_index, label in enumerate(LABEL_LIST):
      if label in LABELS_TO_EMBED and scores[label_index] > DETECTION_THRESHOLD:
        embed_hear_clips.append(audio_clip_batch[clip_i])
        break

  # Perform HeAR batch inference to extract the associated clip embedding.
  # Only run inference on 'embed_hear_clips' which have have a high detection
  # score for one of the 'LABELS_TO_EMBED'.
  if len(embed_hear_clips):
    print(f"Computing HeAR embedding for batch of",
          f"{len(embed_hear_clips)} selected clips.")
    hear_embedding_batch = hear_infer(x=np.asarray(embed_hear_clips))[ 'output_0'].numpy()
    print(f"Embedding batch shape: {hear_embedding_batch.shape},",
          f"data type: {hear_embedding_batch.dtype}")
  else:
    hear_embedding_batch = np.array([])
    print(f"None of the {len(audio_clip_batch)} clips in {file_key} have",
          f"detections above the threshold: {DETECTION_THRESHOLD} for the",
          f"labels: {LABELS_TO_EMBED}")
  hear_embeddings[file_key]['embeddings'] = hear_embedding_batch

In [None]:
# @title Use HeAR embeddings to find most similar file to query file
from scipy.spatial import distance

# Set up query file and make sure if the query file_key exists in the dictionary and has embeddings.
query_file_key = 'Cough_1.ogg'
assert query_file_key in hear_embeddings and len(hear_embeddings[query_file_key]['embeddings'])

# Get the average embedding for the query file and compare similarity to the average embedding for the other files.
query_embedding = np.mean(hear_embeddings[query_file_key]['embeddings'], axis=0)
similarities = {}
for file_key, model_outputs in hear_embeddings.items():
  # Skip comparing file_key to itself or comparing to keys without HeAR embeddings.
  if file_key == query_file_key or not len(model_outputs['embeddings']):
    continue

  # Compute cosine similarity between the query file and the current file.
  current_embedding = np.mean(model_outputs['embeddings'], axis=0)
  similarities[file_key] = 1 - distance.cosine(query_embedding, current_embedding)

# Find the top N most similar entries
N = 3
top_N_similar = dict(sorted(similarities.items(), key=lambda item: item[1], reverse=True)[:N])
print(f"\nTop {N} most similar entries to '{query_file_key}':")
for key, similarity in top_N_similar.items():
    print(f"  {key}: {similarity:.3f}")

# Extract Batch Frontend Spectrogram Features

We provide a standalone, frozen frontend spectrogram model for efficient generation of [PCEN Mel-Spectrogram](https://research.google/pubs/trainable-frontend-for-robust-and-far-field-keyword-spotting/) features, which are used by the event detectors. This model is comprised of non-trainable TensorFlow operations, prioritizing portability, scalability, and optimization at the cost of configurability.

The frontend can also be used as a standalone feature extractor for generating large amounts of training data for finetuning or retraining additional models as demonstrated below.

The input must be 2-seconds and corresponding output feature will have shape `(200, 48)`. Typically, input clips will be segmented with some amount of overlap to avoid distorting sounds near the boundry.

In [None]:
# @title Plot Example Frontend Features
example_file_key =  'Sneezing.ogg'
audio, sample_rate = librosa.load(example_file_key, sr=SAMPLE_RATE, mono=True)
print(f"Loaded {example_file_key} with duration {len(audio)/sample_rate:0.2f}s")

# Extract example clips with 50% overlap, then generate frontend feature batch.
frame_length = int(CLIP_DURATION * SAMPLE_RATE)
audio_clip_batch = tf.signal.frame(audio, frame_length, frame_length // 2 )
frontend_feature_batch = frontend_model(audio_clip_batch)
print(f"Generated input feature batch from {example_file_key} with shape: {frontend_feature_batch.shape}")
for clip_i, frontend_feature in enumerate(frontend_feature_batch):
  plot_frontend_feature(frontend_feature.numpy(), title=f'Clip: {clip_i}', cmap=COLORMAP, figsize=(8, 3))

# Convert Event Detectors to TFLite

The included event detectors are based on [MobileNet-V3](https://huggingface.co/docs/timm/en/models/mobilenet-v3) and are designed to be run on-device with low latency and power requirements while also allowing for finetuning and quick training.

Converting the event detector models to [LiteRT](https://ai.google.dev/edge/litert) (formally TFLite) allows for more optimal performance on specific hardware such as mobile devices or compute constrained servers.

In order to save compute in realtime or large scale applications, these hardware optimized event detectors can be used as a gate for HeAR, only embedding clips detected to contain a health sound of interest (as shown above).

The code below demonstrates converting the loaded event detector to a TensorFlow LiteRT model, there are many more options available to quantize or furthur optimize the converted models, see the [documentation](https://ai.google.dev/edge/litert/models/convert_tf) for more information.

In [None]:
# @title TFLite Conversion
%%time

def convert_to_tflite(
    model: tf.keras.Model,
    quantize: bool = False,
) -> bytes:
  """Converts a SavedModel model to a TensorFlow Lite (TFLite) model.

  Args:
    model: The model to convert.
    quantize: If True, apply dynamic range quantization to optimize the model.

  Returns:
    The raw byte representation of the converted TFLite model.
  """
  converter = tf.lite.TFLiteConverter.from_keras_model(model)
  converter.target_spec.supported_ops = [
      tf.lite.OpsSet.TFLITE_BUILTINS,
      tf.lite.OpsSet.SELECT_TF_OPS,  # needed for frontend ops
  ]
  # See documentation for quantization options beyond dyanmic range quantization.
  # https://ai.google.dev/edge/litert/models/post_training_quantization
  if quantize:
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
  return converter.convert()

# Convert event detector to TFLite and save result.
event_detector_lite = convert_to_tflite(event_detector, quantize=False)
tflite_output_path ='event_detector.tflite'
with open(tflite_output_path, 'wb') as f:
  f.write(event_detector_lite)
print(f"Saved TFLite model to: {tflite_output_path}")

# Initalize TFLite Model and print I/O specification.
tflite_interp = tf.lite.Interpreter(model_content=event_detector_lite)
print(f"Input details:\n {tflite_interp.get_input_details()[0]}")
print(f"Output details: \n {tflite_interp.get_output_details()[0]}")

# Next steps

Explore the other [notebooks](https://github.com/google-health/hear/blob/master/notebooks) to learn what else you can do with the model.