Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

AudioCaption : Listen and Tell

This repository provides source code for several models on audio captioning as well as labels of several datasets.

Firstly please checkout this repository.

git clone


For all datasets, labels are provided in the directory data/*.json.



The full AudioCaption hospital dataset (3710 video clips) can be downloaded via google drive .

There is also a Zenodo link: DOI

The audio-only part of the dataset can be downloaded via google drive.

An easy way to download the dataset is by using the pip script gdown. pip install gdown will install that script. Then:

cd data

If you need a proxy to download the dataset, we recommend using Proxychains.


The dataset on car scene can be downloaded via google drive.

The source code for the proposed sentence-level loss is also provided here.

Related Papers

Here are papers related to this repository:

If you'd like to use the AudioCaption dataset, please cite:

  author    = {Mengyue Wu and
               Heinrich Dinkel and
               Kai Yu},
  title     = {Audio Caption: Listen and Tell},
  booktitle = {{IEEE} International Conference on Acoustics, Speech and Signal Processing,
               {ICASSP} 2019, Brighton, United Kingdom, May 12-17, 2019},
  pages     = {830--834},
  publisher = {{IEEE}},
  year      = {2019},
  url       = {},
  doi       = {10.1109/ICASSP.2019.8682377},
  timestamp = {Wed, 16 Oct 2019 14:14:52 +0200},


In order to sucessfully run the baseline, the following packages and frameworks are required:

  1. Kaldi (mostly for data processing)
  2. A bunch of Python3 packages ( most notably torch, see requirements.txt )

Prequisite Installation

The code is written exclusively in Python3. In order to install all required packages use the included requirements.txt. pip install -r requirements.txt does the job.

For this code, only the feature pipeline of kaldi is utlilized, thus only the feature packages need to be installed in order to function

git clone kaldi --origin upstream
cd kaldi && git pull
cd tools; make
cd ../src; make -j4 featbin

Lastly, create a new environment variable for the script to function properly. Either locally export in your current session the variable KALDI_ROOT or put it into ~/.bashrc or ~/.profile.


(Optional) NLP Tokenizer

This repository already provided the tokenized dataset in the json format. However, if one wishes to tokenize differently (e.g., tokenize by some custom NLP tokenizer), we also provide a simple script to install and run the Stanford NLP Tokenizer.

This dataset is labelled in Chinese. Chinese has some specific differences to most Indo-European languages, including its script. In particular, Chinese does not use an indicator for word separation, as English does with a blank space. Rather it depends on the reader to split a sentence into semantically sound tokens.

However, the Stanford CoreNLP software provides support for tokenization of Chinese. The script downloads all the necessary plugins for the CoreNLP tool in order to enable tokenization. The script utils/ does need a running server in the background in order to work.

Downloading and running the CoreNLP tokenization server only needs to execute:

bash scripts/

It requires at least java being installed on your machine. It is recommended to run this script in the background.

(Optional) BERT Pretrained Embeddings

You can load pretrained word embeddings in Google BERT instead of training word embeddings from scratch. The scripts in utils/bert need a BERT server in the background. We use BERT server from bert-as-service.

To use bert-as-service, you need to first install the repository. It is recommended that you create a new environment with Tensorflow 1.3 to run BERT server since it is incompatible with Tensorflow 2.x.

After successful installation of bert-as-service, downloading and running the BERT server needs to execute:

bash scripts/ <path-to-server> <num-workers> zh

By default, server based on BERT base Chinese model is running in the background. You can change to other models by changing corresponding model name and path in scripts/

To extract BERT word embeddings, you need to execute utils/bert/, where the usage is shown.

Extract Features

The kaldi scp format requires a tab or space separated line with the information: FEATURENAME WAVEPATH

For example, to extract feature from hospital data, assume the raw data is placed in DATA_DIR (data/hospital/wav here) and you will store features in FEATURE_DIR (data/hospital here):

find $DATA_DIR -type f | awk -F[./] '{print "'$PREFIX'""_"$(NF-1),$0}' > $FEATURE_DIR/wav.scp
  • Filterbank:
compute-fbank-feats --config=config/kaldi/fbank.conf scp:$FEATURE_DIR/wav.scp ark:$FEATURE_DIR/fbank.ark
python utils/ $FEATURE_DIR/fbank.ark $FEATURE_DIR/fbank.hdf5
rm $FEATURE_DIR/fbank.ark
  • Logmelspectrogram:
python utils/ -prefix $PREFIX `cat $FEATURE_DIR/wav.scp | awk '{print $2}'` $FEATURE_DIR/logmel.hdf5 $FEATURE_DIR/logmel.scp mfcc -win_length 1764 -hop_length 882

The kaldi scp file can be further split into a development scp and an evaluation scp:

python utils/ $FEATURE_DIR/fbank.scp $FEATURE_DIR/zh_eval.json
python utils/ $FEATURE_DIR/logmel.scp $FEATURE_DIR/zh_eval.json

Training Configurator

Training configuration is done in config/*.yaml. Here one can adjust some hyperparameters e.g., number of hidden layers or embedding size. You can also write your own models in models/*.py and adjust the config to use that model (e.g. encoder: MYMODEL).

Note: All parameters within the runners/*.py script use exclusively parameters with the same name as their .yaml file counterpart. They can all be switched and changed on the fly by passing --ARG VALUE, e.g., if one wishes to switch the captions file to use English captions, pass --caption_file data/hospital/en_dev.json.

Training models

In order to train a model (for example using standard cross entropy loss), simply run:

python runners/ train config/xe.yaml

This will store the training logs and model checkpoints in OUTPUTPATH/MODEL/TIMESTAMP.

Predicting and Evaluating

Predicting and evaluating is done by running evaluate:

export kaldi_stream="copy-feats scp:$FEATURE_DIR/fbank_eval.scp ark:- |"
export experiment_path=experiments/***
python runners/ evaluate $experiment_path "$kaldi_stream" $FEATURE_DIR/zh_eval.json

Standard machine translation metrics (BLEU@1-4, ROUGE-L, CIDEr, METEOR and SPICE) are included, where METEOR and SPICE can only be used on English datasets.


Dataset and baseline for the first Audiocaption task





No releases published


No packages published

Contributors 4