Text-to-Audio Grounding

This repository provides the data and source code for Text-to-Audio Grounding (TAG) task.

Data

The AudioGrounding dataset is an augmented audio captioning dataset. It is based on AudioCaps, which is established using part of a audio event dataset, AudioSet. Therefore, audio files can be downloaded from AudioSet.

The updated AudioGrounding v2 is available in . Changes in version 2:

Train/val/test sets are re-split and refined.
Data are re-formatted and audio files are renamed.

The current label format: a list of audio_item, each containing

audiocap_id: id in the AudioCaps label file
audio_id: audio filename, in the form of Y[youtube_id].wav
tokens: caption tokenized by NLTK
phrases: a list of phrase_item
- phrase: tokens of the current query phrase
- start_index: index of the first token of phrase, starting from 0
- end_index: index of the last token of phrase
- segments: a list of [onset, offset] timestamp annotations

TAG Baseline

We provide a baseline approach for TAG in this repository. To run the baseline:

checkout the code and install the required python packages:

git clone https://github.com/wsntxxn/TextToAudioGrounding
pip install -r requirements.txt

download audio clips and labels from Zenodo.
pack waveforms, assume the audio files are in $AUDIO:

mkdir data/audiogrounding
for split in train val test; do
  python utils/data/prepare_wav_csv.py $AUDIO/$split data/audiogrounding/$split/wav.csv
  python utils/data/pack_waveform.py data/audiogrounding/$split/wav.csv \
      -o data/audiogrounding/$split/waveform.h5 \
      --sample_rate 32000
done
python utils/data/prepare_duration.py data/audiogrounding/test/wav.csv data/audiogrounding/test/duration.csv

prepare vocabulary file:

python utils/build_vocab.py data/audiogrounding/train/label.json data/audiogrounding/train/vocab.pkl

run the training and evaluation:

python python_scripts/training/run_strong.py train_evaluate \
    --train_config $TRAIN_CFG \
    --eval_config $EVAL_CFG

Or alternatively,

python python_scripts/training/run_strong.py train \
    --config $TRAIN_CFG
python python_scripts/training/run_strong.py evaluate \
    --experiment_path $EXP_PATH \
    --eval_config $EVAL_CFG

$TRAIN_CFG and $EVAL_CFG are yaml-formatted configuration files. $EXP_PATH is the checkpoint directory set in $TRAIN_CFG. Example configuration files are provided here.

Weakly-Supervised Text-to-Audio Grounding (WSTAG)

Inference

We provide the best-performing WSTAG model, downloaded here. Unzip it into $MODEL_DIR:

unzip audiocaps_cnn8rnn_w2vmean_dp_ls_clustering_selfsup.zip -d $MODEL_DIR

Remember to modify the training data vocabulary path in $MODEL_DIR/config.yaml (data.train.collate_fn.tokenizer.args.vocabulary) to $MODEL_DIR/vocab.pkl. To ensure that the vocabulary file used for training is loaded for inference, the inference script uses the vocabulary path specified in $MODEL_DIR/config.yaml.

Inference:

python python_scripts/inference/inference.py inference_multi_text_model \
    --experiment_path $MODEL_DIR \
    --audio $AUDIO \
    --phrase $PHRASE \
    --output ./prob.png

Training

For all settings, training is done in the same way as in the baseline:

python $TRAIN_SCRIPT train_evaluate \
    --train_config $TRAIN_CFG \
    --eval_config $EVAL_CFG

The training scripts and configurations vary for different settings. We provide the training script and example configuration file in each setting.

Data Format

WSTAG uses audio captioning data for training. The format of training data is the same as AudioGrounding, with the only difference that there is no segments in phrase_item. You can convert the original captioning data into this format by yourself. The phrase parsing rules are provided here. The waveform packing and vocabulary preparation process is also the same as in the baseline.

Sentence-level WSTAG

TRAIN_SCRIPT: run_weak_sentence.py
TRAIN_CFG: cnn8rnn_w2vmean_dp_amean_tmean.yaml
EVAL_CFG: eval.yaml

Phrase-level WSTAG

For all phrase-level settings, the example EVAL_CFG is eval.yaml. For all phrase-level settings except "X + self-supervision", TRAIN_SCRIPT is run_weak_phrase.py.

random sampling

TRAIN_CFG: cnn8rnn_w2vmean_random.yaml

similarity-based sampling

TRAIN_CFG: cnn8rnn_w2vmean_similarity.yaml

Similarity-based sampling requires pre-computed phrase embeddings. We use the contrastive audio-text model trained on AudioCaps to extract phrase embeddings. Download the model from here, unzip it into $CLAP_DIR, then extract embeddings:

unzip audiocaps_cnn14_bertm.zip -d $CLAP_DIR
python utils/data/create_text_embedding/prepare_phrase_clap.py phrase \
    --experiment_path $CLAP_DIR \
    --phrase_input $DATA \
    --output $OUTPUT \
    --with_proj True

Then modify the data.train.dataset.args.phrase_embed item in the training configuration file to $OUTPUT accordingly.

clustering-based sampling

TRAIN_CFG: cnn8rnn_w2vmean_clustering.yaml

Clustering-based sampling requires clustering models. We train clustering models based on the pre-computed phrase embeddings.

python python_scripts/clustering/kmeans_emb.py \
    --embedding $PHRASE_EMB \
    --n_cluster $N_CLUSTER \
    --output $OUTPUT

$PHRASE_EMB is the phrase embedding file, i.e., $OUTPUT of the previous step. Remember to modify the data.train.dataset.args.cluster_map to the corresponding mapping file (not exactly $OUTPUT).

X (any sampling) + self-supervision

TRAIN_SCRIPT: run_weak_phrase_self_supervision.py
TRAIN_CFG: cnn8rnn_w2vmean_clustering_selfsup.yaml

The teacher.pretrained should be set to the checkpoint path of the pretrained WSTAG model.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
audio_text_retrieval_models		audio_text_retrieval_models
datasets		datasets
eg_configs		eg_configs
models		models
python_scripts		python_scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
losses.py		losses.py
requirements.txt		requirements.txt

License

wsntxxn/TextToAudioGrounding

Folders and files

Latest commit

History

Repository files navigation