Keyword Spotting on Audio Recordings

This repository provides a pipeline for running keyword spotting experiments on audio recordings using feature extraction, DTW-based retrieval, evaluation and plotting.

Forced Alignment: Aligns audio and XML transcriptions to produce word-level timestamps.
Audio Chunk Extraction: Splits audio into sentence or word-level chunks using XML alignment.
Label Generation: Creates labels files indicating which sentences contain each query word.
Feature Extraction: Extracts representations (XLSR-53, MFCC) for each chunk.
DTW Scoring: Computes DTW similarity between query and sentence embeddings.
Evaluation: Computes precision and recall at k for keyword retrieval.

Usage

1. Forced or manual Alignment

Create timestamps for each word in the transcription.

https://github.com/rfclara/nlp_pangloss

Used nlp_pangloss.word_forced_alignment, requires a trained ASR model.

2. Audio Chunk Extraction

Extract sentence or word-level audio chunks and annotations:

pixi run python src/parse_transcriptions_xml.py \
  data/235213.xml --output_dir data/processed/235213/sentences

TODO : Add word-level chunk extraction for queries, from aligned XML.

3. Label Generation

Create a labels file from your annotations and list of query words:

pixi run python dtw/labels_from_annotation.py \
  --annotations data/processed/235213/235213_annotations.tsv \
  --queries æ˧hi˩hi˩ æ˧ʂæ#˥ ɑ˩ʁo˧ ɕi˧-ɕi˩lo˩ ɕi˧tɕɤ˧ dzi˩ hĩ˧-ki˧-ki˩ hɯ˧ kʰv̩˧ʂæ˧˥ qo˩qɑ˩
  --output data/processed/235213/labels_235213.tsv

For multiple queries, provide a space-separated list of words.

4. Feature Extraction

Extract representations for queries and sentences:

pixi run python dtw/extract_representations.py \
  --corpus_dir data/processed/235213/queries \
  --output data/processed/235213/queries_xlsr_embeddings \
  --model xlsr53

pixi run python dtw/extract_representations.py \
  --corpus_dir data/processed/235213/sentences \
  --output data/processed/235213/sentences_xlsr_embeddings \
  --model xlsr53

You can also use --model mfcc for MFCC features.

5. DTW Scoring

Compute DTW scores between queries and sentences:

pixi run python dtw/dtw.py \
  --queries_dir data/processed/235213/queries_xlsr_embeddings \
  --sentences_dir data/processed/235213/sentences_xlsr_embeddings \
  --scores_dir data/processed/235213/dtw_scores_xlsr \
  --normalize

6. Evaluation

Evaluate DTW retrieval results:

pixi run python dtw/evaluate.py \
  --labels data/processed/235213/labels_235213.tsv \
  --scores_folder data/processed/235213/dtw_scores_xlsr \
  --results_folder data/processed/235213/evaluation_results

Outputs

Audio chunks: data/processed/235213/sentences/
Embeddings: data/processed/235213/queries_xlsr_embeddings/, data/processed/235213/sentences_xlsr_embeddings/
DTW scores: data/processed/235213/dtw_scores_layer_25/
Labels: data/processed/235213/labels_235213.tsv
Evaluation results: data/results/235213/evaluation_results/

Notes

All scripts can be run with pixi run python ... for reproducible environments.
Evaluation results include precision and recall at various k values for each query and feature type.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Keyword Spotting on Audio Recordings

Usage

1. Forced or manual Alignment

2. Audio Chunk Extraction

3. Label Generation

4. Feature Extraction

5. DTW Scoring

6. Evaluation

Outputs

Notes

About

Uh oh!

Languages

rfclara/dtw

Folders and files

Latest commit

History

Repository files navigation

Keyword Spotting on Audio Recordings

Usage

1. Forced or manual Alignment

2. Audio Chunk Extraction

3. Label Generation

4. Feature Extraction

5. DTW Scoring

6. Evaluation

Outputs

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages