### Description

This dataset consists of high-quality voice recordings collected from participants in a controlled clinical research setting. Each participant provided a ~30-second sustained counting sample (in Hebrew, with a subset in Japanese), recorded using a cardioid dynamic USB/XLR microphone and Audacity software. Recordings are stored in lossless `.flac` format to preserve full audio fidelity, accompanied by participant metadata (Participant ID, sex, date of birth) in a structured `.csv` file.

### Introduction

Voice data is increasingly being used in clinical research and personalized medicine. Key trends include using vocal biomarkers for disease diagnosis (e.g., Parkinson's, Alzheimer's), mental health assessment, and monitoring respiratory conditions. Additionally, voice-based technologies are being developed for treatment adherence, remote monitoring, and personalized voice assistants in healthcare. Voice data is also explored for emotion recognition and speech therapy applications. Overall, voice data has the potential to transform healthcare by providing non-invasive, cost-effective, and convenient diagnostic and treatment methods.

In the context of voice, a **vocal biomarker** is a signature, a feature, or a combination of features from the audio signal of the voice that is associated with a clinical outcome and can be used to monitor patients, diagnose a condition, or grade the severity or the stages of a disease or for drug development.

Work on vocal biomarkers has mainly been performed in the field of neurodegenerative disorders, particularly Parkinson's disease, where voice disorders are very frequent (as high as 70-80%) and where voice changes are expected to be utilized as an early diagnostic biomarker. Other areas of research include mental health and monitoring emotions, multiple sclerosis and rheumatoid arthritis, Alzheimer's disease and mild cognitive impairment, cardiometabolic and cardiovascular diseases, and COVID-19 and other conditions with lung and respiratory symptoms.

### Measurement protocol 
<!-- long measurment protocol for the data browser -->

The participant's voice is audio-recorded for around thirty (30) seconds by counting to thirty (30) in their native language (primarily Hebrew, with a subset in Japanese).

#### Device and Software

**Device type:** Cardioid dynamic USB/XLR microphone  
**Software:** Audacity (open-source audio recording/editing software)  
**Recording format:** Lossless `.flac` (primary), `.wav` supported  
**Recording duration:** ~30 seconds (counting 1–30)  
**Recording environment:** Quiet, controlled clinical setting

#### Device Specifications

| Parameter | Specification |
|-----------|---------------|
| Device type | Cardioid dynamic USB/XLR microphone |
| Polar pattern | Cardioid (unidirectional) |
| Connection type | USB-C / XLR |
| Sampling rate | 44.1 kHz (default), supports up to 48 kHz |
| Bit depth | 16-bit / 24-bit |
| Frequency response | 50 Hz – 15 kHz |
| Microphone sensitivity | -55 dBV/Pa (at 1 kHz) |
| Recording environment | Quiet, controlled clinical setting |
| Contraindications / limitations | Avoid high background noise; maintain fixed mic distance (15–30 cm) from mouth |
| Accessories used | Desktop mic stand, pop filter |

### Data availability 
<!-- for the example notebooks -->

The information is stored as individual audio files in `.flac` format (lossless compression), with associated metadata in `.csv` files. Each recording is approximately 30 seconds in duration.

### Summary of available data 
<!-- for the data browser -->

**Voice recordings:** High-dimensional time-series data captured as audio recordings (`.flac` format), obtained by participants counting to thirty (30) in Hebrew (with a subset in Japanese). The `.flac` format retains full audio quality (lossless) and is preferred for clinical signal fidelity.

The dataset includes:
- Audio recordings in lossless `.flac` format
- Participant metadata (ID, sex, date of birth)
- Recording metadata (date, duration, language)

As of the current release, the dataset contains 9,246 Hebrew-speaking participants with an ongoing subset of Japanese-language recordings.


### Introduction

Voice data is increasingly being used in clinical research and personalized medicine. Key trends include using vocal biomarkers for disease diagnosis (e.g., Parkinson's, Alzheimer's), mental health assessment, and monitoring respiratory conditions. Additionally, voice-based technologies are being developed for treatment adherence, remote monitoring, and personalized voice assistants in healthcare. Voice data is also explored for emotion recognition and speech therapy applications. Overall, voice data has the potential to transform healthcare by providing non-invasive, cost-effective, and convenient diagnostic and treatment methods.

In the context of voice, a **vocal biomarker** is a signature, a feature, or a combination of features from the audio signal of the voice that is associated with a clinical outcome and can be used to monitor patients, diagnose a condition, or grade the severity or the stages of a disease or for drug development.

Work on vocal biomarkers has mainly been performed in the field of neurodegenerative disorders, particularly Parkinson's disease, where voice disorders are very frequent (as high as 70-80%) and where voice changes are expected to be utilized as an early diagnostic biomarker.

#### Areas of research
- Mental Health and Monitoring Emotions
- Multiple Sclerosis and Rheumatoid Arthritis
- Alzheimer's Disease and Mild Cognitive Impairment
- Cardiometabolic and Cardiovascular Diseases
- COVID-19 and Other Conditions with Lung and Respiratory Symptoms


### Measurement protocol

The participant's voice is audio-recorded for around thirty (30) seconds by counting to thirty (30) in their native language (primarily Hebrew, with a subset in Japanese).

#### Device and Software

**Device type:** Cardioid dynamic USB/XLR microphone  
**Software:** Audacity (open-source audio recording/editing software)  
**Recording format:** Lossless `.flac` (primary), `.wav` supported  
**Recording duration:** ~30 seconds (counting 1–30)  
**Recording environment:** Quiet, controlled clinical setting

#### Device Specifications

| Parameter | Specification |
|-----------|---------------|
| Device type | Cardioid dynamic USB/XLR microphone |
| Polar pattern | Cardioid (unidirectional) |
| Connection type | USB-C / XLR |
| Sampling rate | 44.1 kHz (default), supports up to 48 kHz |
| Bit depth | 16-bit / 24-bit |
| Frequency response | 50 Hz – 15 kHz |
| Microphone sensitivity | -55 dBV/Pa (at 1 kHz) |
| Recording environment | Quiet, controlled clinical setting |
| Contraindications / limitations | Avoid high background noise; maintain fixed mic distance (15–30 cm) from mouth |
| Accessories used | Desktop mic stand, pop filter |


### Data availability

The data comprises two levels of processing:

1. **Voice recordings:** High-dimensional time-series data captured as audio recordings (`.flac` format), obtained by participants counting to thirty (30) in Hebrew. The `.flac` format retains full audio quality (lossless) and is preferred for clinical signal fidelity.

2. **Associated metadata:** A structured `.csv` file containing participant-level information, including Participant ID, Sex, and Date of Birth (DOB).

**Total participants:** 9,246 (including a few Japanese-speaking participants)


### Summary of available data

#### Raw Data
- **Voice recordings:** Lossless `.flac` audio files containing ~30 seconds of counting
- **Metadata:** CSV file with participant ID, sex, and date of birth

#### Info on our raw data

A `.flac` or `.wav` file stores audio data using lossless compression, retaining all original information. When loaded using Librosa with the `sr` (sampling-rate) parameter set to `None`, it decompresses the audio data and converts it into a NumPy array without resampling. This array represents the waveform, where each element is the amplitude of the audio signal at a specific point in time.

Digital audio data is processed using specialized libraries like Librosa, Praat, OpenSmile (Traditional medical and signal-based features - current version V1.0), and Wav2Vec (deep learning based features - next version V2.0) for feature extraction, signal processing, and visualization.

#### Useful Python Packages
- **Librosa:** Traditional signal processing and feature extraction tool
- **Praat:** Clinical linguistic feature extraction tool
- **OpenSmile:** Audio feature extraction
- **BlaBla:** Clinical linguistic feature extraction tool
- **Wav2Vec2:** Deep learning network for feature extraction (V2.0)


### Features

#### 1. Traditional Signal Processing–Based Features (Clinical Relevant): V1.0

Acoustic features offer an objective way to quantify speech patterns that may be altered in clinical populations. Below are the key features extracted:

##### Fundamental Frequency (F0)
- Captures the basic pitch (lowest frequency) of vocal fold vibration
- Reflects both physiological voice-production characteristics and prosodic patterns
- In schizophrenia, there's consistently reduced pitch variability and monotonous prosody, tied to blunted affect
- F0 is increasingly considered a candidate vocal biomarker for schizophrenia

##### Formants (F1 – F3)
- Formants are resonant frequencies shaped by vocal tract and articulator positions, notably tongue and jaw
- Altered formant patterns suggest imprecise articulation, motor planning deficits, and cognitive planning issues in schizophrenia
- Clinical relevance lies in the ratios and distances between F1, F2, and F3, which reflect vocal tract configuration and motor function

##### Harmonics-to-Noise Ratio (HNR)
- Quantifies the ratio of periodic (harmonic) vs. aperiodic (noise) components in voice
- Lower HNR indicates breathiness, roughness, or dysphonia—commonly seen in schizophrenia
- Objective measures like HNR help reduce subjectivity in clinical voice evaluation

##### Jitter
- Reflects cycle-to-cycle variation in pitch (frequency perturbation)
- Elevated jitter indicates irregular laryngeal control and reduced neuromotor precision
- Normative range: <0.5% for adults

##### Shimmer
- Captures cycle-to-cycle variation in amplitude (amplitude perturbation)
- Higher shimmer signals reduced fine motor control in vocal folds
- Clinically associated with changes in glottal resistance and vocal roughness

##### Amplitude (Intensity)
- Represents loudness dynamics—driven by respiratory support and communicative intent
- Reduced amplitude modulation often manifests in conditions like schizophrenia, congruent with negative symptoms like flat affect

##### Mel-Frequency Cepstral Coefficients (MFCCs)
- Compact spectral descriptors aligned with human auditory perception
- Encode both low-frequency spectral envelope and higher-frequency details
- In schizophrenia studies:
  - Support discrimination from healthy controls (accuracy ~82%)
  - Achieve very high diagnostic performance (accuracy 91.7%, ROC‑AUC ≈ 0.963) when combined with mel‑spectrograms
  - In multimodal approaches (speech + video), combining MFCCs with vocal tract features boosted detection by ~18%

##### Spectrogram
- Time-frequency representation of the audio signal
- Log-mel spectrograms, often used in deep learning pipelines, enhance differentiation of schizophrenia in speech
- Can serve as input "images" to vision-based models (e.g., convolutional neural networks)

**Tools/Models:** OpenSMILE, Praat, Librosa  
**Clinical Use:** Jitter, shimmer, HNR; mood and fatigue disorders often alter pitch dynamics, shimmer, and jitter


#### 2. Additional Non-Clinical Signal Features (Time and Frequency Domain): V1.0

- **Duration:** Total length of the audio signal (seconds)
- **Sample Rate:** Number of audio samples captured per second (Hz)
- **Number of Samples:** Total count of individual audio samples in the signal
- **Number of Channels:** Distinct audio channels (e.g., mono, stereo)
- **Bit Depth:** Bits used to represent each audio sample (affects resolution)
- **Number of Frames:** Total frames, each containing one sample per channel
- **Spectral Centroid:** "Center of mass" of the spectrum (brightness)
- **Spectral Rolloff:** Frequency below which a set percentage of spectral energy is contained
- **Spectral Bandwidth:** Width of the frequency band with most energy
- **Spectral Contrast:** Difference in energy between peaks and valleys of the spectrum
- **Spectral Flatness:** How noise-like (flat) or tone-like (peaky) the spectrum is
- **Zero Crossing Rate:** Rate at which the signal changes sign (related to noisiness)
- **RMS Energy:** Root mean square energy (perceived loudness)
- **Tempo:** Estimated beats per minute (BPM)
- **Chroma:** Distribution of energy across the 12 pitch classes of the octave
  - Chroma Mean: Average chroma values over time
  - Chroma Std: Standard deviation of chroma values (variability)
- **Tonnetz:** Tonal relationships in music based on harmonic intervals
  - Tonnetz Mean: Average tonal relationship values over time
  - Tonnetz Std: Standard deviation of tonal relationship values

#### 3. Foundational Model / Deep Learning–Based Features (Further Research Required): V2.0

##### Model-Based SSL Embeddings
- **Models:** wav2vec 2.0, WavLM, XLS-R, HuBERT
- **Clinical Use:** Encode broad vocal-tract information (generic)

##### Speaker & Diarisation Vectors
- **Models:** x-vector
- **Clinical Use:** Capture subtle vocal-tract/breathing habits; shown best performance for male sleep apnoea in research

##### Affect / Prosody Models
- **Models:** wav2vec2-SER, WavLM-SED
- **Clinical Use:** Mood and fatigue disorders often alter pitch dynamics and shimmer


### Data Quality and Sanitation

Quality control for voice data focuses on ensuring clean, analyzable recordings with minimal artifacts. All incoming audio undergoes automated checks to identify and flag recordings with:

- Silence or low-signal segments exceeding a defined threshold (e.g., >30% of duration)
- Excessive background noise based on spectral flatness, zero-crossing rate, and SNR measures (V2.0)
- Clipping or distortion caused by over-amplification
- Out-of-bounds acoustic values (e.g., pitch range outside human norms)

QC status is stored in metadata, and failed recordings are reviewed for possible re-recording or exclusion from downstream analysis.

**Note:** There aren't any set QC protocols for voice signal processing. We are developing signal quality and status measurements with the help of existing features, including methods to find silence in audio signals, out-of-bounds values, extreme noise, etc.


### Association with Clinical Factors

Current research in voice recordings related to health has shown associations between voice features and various health conditions:

- **Aging:** Voice features, such as fundamental frequency, formants, and jitter, have been found to change with age, reflecting physiological changes in the vocal tract and larynx.

- **Emotional state:** Voice features, such as pitch, intensity, and speech rate, can be used to detect and differentiate various emotional states (e.g., happiness, sadness, anger, fear).

- **Neurological disorders:** Voice features, such as pitch variability, irregular phonation, and speech rate, have been used to detect and monitor neurological disorders like Parkinson's disease, Alzheimer's disease, and multiple sclerosis.

- **Psychiatric disorders:** Voice features have been used to detect and monitor psychiatric disorders such as depression, anxiety, bipolar disorder, and schizophrenia. This may include changes in prosody, speech rate, or other acoustic features.

- **Vocal fatigue:** Voice features, such as jitter, shimmer, and harmonics-to-noise ratio, can be used to assess vocal fatigue and monitor the vocal health of professional voice users like singers, actors, and teachers.

- **Respiratory disorders:** Voice features, such as breathiness, pitch variability, and speech rate, have been used to detect and monitor respiratory disorders like asthma, chronic obstructive pulmonary disease (COPD), and laryngeal disorders.

- **Speech disorders:** Voice features, such as formants, pitch, and speech rate, have been used to detect and monitor speech disorders like stuttering, cluttering, and speech sound disorders.

- **Language identification:** Voice features, such as MFCCs, formants, and pitch, have been used to identify the language spoken by an individual or to detect accents and dialects.


### Relevant links

* [Pheno Knowledgebase](https://knowledgebase.pheno.ai/)
* [Pheno Data Browser](https://pheno-demo-app.vercel.app/)

### References

- Survey On Biomarkers In Human Vocalizations - arXiv
- Neurological Voice Disorders: A Review, Tiffany V Wang, Phillip C Song
