~~~
Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
~~~


# Quick start with Hugging Face (PyTorch model)

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/google-health/hear/blob/master/notebooks/quick_start_with_hugging_face_pytorch.ipynb">
      <img alt="Google Colab logo" src="https://www.tensorflow.org/images/colab_logo_32px.png" width="32px"><br> Run in Google Colab
    </a>
  </td>  
  <td style="text-align: center">
    <a href="https://github.com/google-health/hear/blob/master/notebooks/quick_start_with_hugging_face_pytorch.ipynb">
      <img alt="GitHub logo" src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" width="32px"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://huggingface.co/google/hear-pytorch">
      <img alt="HuggingFace logo" src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" width="32px"><br> View on HuggingFace
    </a>
  </td>
</tr></tbody></table>

This Colab notebook provides a basic usage example of the HeAR encoder that generates a machine learning representation (known as "embeddings") from health-related sounds (2-second audio clips sampled at 16kHz). These embeddings can be used to develop custom machine learning models for health acoustic use-cases with less data and compute compared to traditional model development methods.

 Learn more about embeddings and their benefits at [this page](https://developers.google.com/health-ai-developer-foundations/hear).

## Install dependencies

In [None]:
! git clone https://github.com/Google-Health/hear.git
! pip install --upgrade --quiet transformers==4.50.3

## Authenticate with HuggingFace, skip if you have a HF_TOKEN secret

In [None]:
from huggingface_hub.utils import HfFolder

if HfFolder.get_token() is None:
    from huggingface_hub import notebook_login
    notebook_login()

## Load and play cough audio recording

In [None]:
SAMPLE_RATE = 16000  # Samples per second (Hz)
CLIP_DURATION = 2    # Duration of the audio clip in seconds
CLIP_LENGTH = SAMPLE_RATE * CLIP_DURATION  # Total number of samples

In [None]:
!wget -nc https://upload.wikimedia.org/wikipedia/commons/b/be/Woman_coughing_three_times.wav

In [None]:
from scipy.io import wavfile

# Load file
with open('Woman_coughing_three_times.wav', 'rb') as f:
  original_sampling_rate, audio_array = wavfile.read(f)

print(f"Sample Rate: {original_sampling_rate} Hz")
print(f"Data Shape: {audio_array.shape}")
print(f"Data Type: {audio_array.dtype}")


In [None]:
from IPython.display import Audio, display
import importlib
audio_utils = importlib.import_module(
    "hear.python.data_processing.audio_utils"
)
resample_audio_and_convert_to_mono = audio_utils.resample_audio_and_convert_to_mono


audio_array = resample_audio_and_convert_to_mono(
  audio_array=audio_array, 
  sampling_rate=original_sampling_rate,
  new_sampling_rate=SAMPLE_RATE,
)
display(Audio(audio_array, rate=SAMPLE_RATE))

## Compute embeddings

In [None]:
from transformers import ViTConfig, ViTModel


# Load the model directly from Hugging Face Hub
configuration = ViTConfig(
    image_size=(192, 128),
    hidden_size=1024,
    num_hidden_layers=24,
    num_attention_heads=16,
    intermediate_size=1024 * 4,
    hidden_act="gelu_fast",
    hidden_dropout_prob=0.0,
    attention_probs_dropout_prob=0.0,
    initializer_range=0.02,
    layer_norm_eps=1e-6,
    pooled_dim=512,
    patch_size=16,
    num_channels=1,
    qkv_bias=True,
    encoder_stride=16,
    pooler_act='linear',
    pooler_output_size=512,
)
loaded_model = ViTModel.from_pretrained(
    "google/hear-pytorch",
    config=configuration
)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch

preprocess_audio = audio_utils.preprocess_audio

# This index corresponds to a cough and was determined by hand. In practice, you
# would need a detector.
START = 0

# Add batch dimension
input_tensor = np.expand_dims(audio_array[START: START + CLIP_LENGTH], axis=0)

# Call inference
infer = lambda audio_array: loaded_model.forward(
    preprocess_audio(audio_array), return_dict=True, output_hidden_states=True)
output = infer(torch.Tensor(input_tensor))

# Extract the embedding vector
embedding_vector = np.asarray(output.pooler_output.detach()).flatten()
print("Size of embedding vector:", len(embedding_vector))

# Plot the embedding vector
plt.figure(figsize=(12, 4))
plt.plot(embedding_vector)
plt.title('Embedding Vector')
plt.xlabel('Index')
plt.ylabel('Value')
plt.grid(True)
plt.show()

# Next steps

Explore the other [notebooks](https://github.com/google-health/hear/blob/master/notebooks) to learn what else you can do with the model.