Skip to content
/ dtw Public

Pretrained speech representations like wav2vec2 and HuBERT exhibit strong anisotropy, leading to high similarity between random embeddings. This work evaluates anisotropy in keyword spotting. Using Dynamic Time Warping, we show that despite anisotropy, wav2vec2 similarity measures effectively identify words in audio..

Notifications You must be signed in to change notification settings

rfclara/dtw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Keyword Spotting on Audio Recordings

This repository provides a pipeline for running keyword spotting experiments on audio recordings using feature extraction, DTW-based retrieval, evaluation and plotting.


  1. Forced Alignment: Aligns audio and XML transcriptions to produce word-level timestamps.
  2. Audio Chunk Extraction: Splits audio into sentence or word-level chunks using XML alignment.
  3. Label Generation: Creates labels files indicating which sentences contain each query word.
  4. Feature Extraction: Extracts representations (XLSR-53, MFCC) for each chunk.
  5. DTW Scoring: Computes DTW similarity between query and sentence embeddings.
  6. Evaluation: Computes precision and recall at k for keyword retrieval.

Usage

1. Forced or manual Alignment

Create timestamps for each word in the transcription.

https://github.com/rfclara/nlp_pangloss

Used nlp_pangloss.word_forced_alignment, requires a trained ASR model.

2. Audio Chunk Extraction

Extract sentence or word-level audio chunks and annotations:

pixi run python src/parse_transcriptions_xml.py \
  data/235213.xml --output_dir data/processed/235213/sentences

TODO : Add word-level chunk extraction for queries, from aligned XML.

3. Label Generation

Create a labels file from your annotations and list of query words:

pixi run python dtw/labels_from_annotation.py \
  --annotations data/processed/235213/235213_annotations.tsv \
  --queries æ˧hi˩hi˩ æ˧ʂæ#˥ ɑ˩ʁo˧ ɕi˧-ɕi˩lo˩ ɕi˧tɕɤ˧ dzi˩ hĩ˧-ki˧-ki˩ hɯ˧ kʰv̩˧ʂæ˧˥ qo˩qɑ˩
  --output data/processed/235213/labels_235213.tsv
  • For multiple queries, provide a space-separated list of words.

4. Feature Extraction

Extract representations for queries and sentences:

pixi run python dtw/extract_representations.py \
  --corpus_dir data/processed/235213/queries \
  --output data/processed/235213/queries_xlsr_embeddings \
  --model xlsr53

pixi run python dtw/extract_representations.py \
  --corpus_dir data/processed/235213/sentences \
  --output data/processed/235213/sentences_xlsr_embeddings \
  --model xlsr53

You can also use --model mfcc for MFCC features.

5. DTW Scoring

Compute DTW scores between queries and sentences:

pixi run python dtw/dtw.py \
  --queries_dir data/processed/235213/queries_xlsr_embeddings \
  --sentences_dir data/processed/235213/sentences_xlsr_embeddings \
  --scores_dir data/processed/235213/dtw_scores_xlsr \
  --normalize

6. Evaluation

Evaluate DTW retrieval results:

pixi run python dtw/evaluate.py \
  --labels data/processed/235213/labels_235213.tsv \
  --scores_folder data/processed/235213/dtw_scores_xlsr \
  --results_folder data/processed/235213/evaluation_results

Outputs

  • Audio chunks: data/processed/235213/sentences/
  • Embeddings: data/processed/235213/queries_xlsr_embeddings/, data/processed/235213/sentences_xlsr_embeddings/
  • DTW scores: data/processed/235213/dtw_scores_layer_25/
  • Labels: data/processed/235213/labels_235213.tsv
  • Evaluation results: data/results/235213/evaluation_results/

Notes

  • All scripts can be run with pixi run python ... for reproducible environments.
  • Evaluation results include precision and recall at various k values for each query and feature type.

About

Pretrained speech representations like wav2vec2 and HuBERT exhibit strong anisotropy, leading to high similarity between random embeddings. This work evaluates anisotropy in keyword spotting. Using Dynamic Time Warping, we show that despite anisotropy, wav2vec2 similarity measures effectively identify words in audio..

Resources

Stars

Watchers

Forks