This repository provides a pipeline for running keyword spotting experiments on audio recordings using feature extraction, DTW-based retrieval, evaluation and plotting.
- Forced Alignment: Aligns audio and XML transcriptions to produce word-level timestamps.
- Audio Chunk Extraction: Splits audio into sentence or word-level chunks using XML alignment.
- Label Generation: Creates labels files indicating which sentences contain each query word.
- Feature Extraction: Extracts representations (XLSR-53, MFCC) for each chunk.
- DTW Scoring: Computes DTW similarity between query and sentence embeddings.
- Evaluation: Computes precision and recall at k for keyword retrieval.
Create timestamps for each word in the transcription.
https://github.com/rfclara/nlp_pangloss
Used nlp_pangloss.word_forced_alignment, requires a trained ASR model.
Extract sentence or word-level audio chunks and annotations:
pixi run python src/parse_transcriptions_xml.py \
data/235213.xml --output_dir data/processed/235213/sentencesTODO : Add word-level chunk extraction for queries, from aligned XML.
Create a labels file from your annotations and list of query words:
pixi run python dtw/labels_from_annotation.py \
--annotations data/processed/235213/235213_annotations.tsv \
--queries æ˧hi˩hi˩ æ˧ʂæ#˥ ɑ˩ʁo˧ ɕi˧-ɕi˩lo˩ ɕi˧tɕɤ˧ dzi˩ hĩ˧-ki˧-ki˩ hɯ˧ kʰv̩˧ʂæ˧˥ qo˩qɑ˩
--output data/processed/235213/labels_235213.tsv- For multiple queries, provide a space-separated list of words.
Extract representations for queries and sentences:
pixi run python dtw/extract_representations.py \
--corpus_dir data/processed/235213/queries \
--output data/processed/235213/queries_xlsr_embeddings \
--model xlsr53
pixi run python dtw/extract_representations.py \
--corpus_dir data/processed/235213/sentences \
--output data/processed/235213/sentences_xlsr_embeddings \
--model xlsr53You can also use --model mfcc for MFCC features.
Compute DTW scores between queries and sentences:
pixi run python dtw/dtw.py \
--queries_dir data/processed/235213/queries_xlsr_embeddings \
--sentences_dir data/processed/235213/sentences_xlsr_embeddings \
--scores_dir data/processed/235213/dtw_scores_xlsr \
--normalizeEvaluate DTW retrieval results:
pixi run python dtw/evaluate.py \
--labels data/processed/235213/labels_235213.tsv \
--scores_folder data/processed/235213/dtw_scores_xlsr \
--results_folder data/processed/235213/evaluation_results- Audio chunks:
data/processed/235213/sentences/ - Embeddings:
data/processed/235213/queries_xlsr_embeddings/,data/processed/235213/sentences_xlsr_embeddings/ - DTW scores:
data/processed/235213/dtw_scores_layer_25/ - Labels:
data/processed/235213/labels_235213.tsv - Evaluation results:
data/results/235213/evaluation_results/
- All scripts can be run with
pixi run python ...for reproducible environments. - Evaluation results include precision and recall at various k values for each query and feature type.