We present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. Additionally, we offer a comprehensive test set and evaluation metrics STEAM to assess the temporal control performance of various models.
Audio samples can be found in the AudioTime-Demo. There are four types of alignment signals:
- Ordering: "A yip occurs, followed by a bleat after a short pause."
- Duration: "A water tap or faucet ran for 4.33 seconds."
- Frequency: "Sanding occurs once, followed by throat clearing twice."
- Timestamp: "An explosion occurs from 0.947 to 2.561 seconds, and then breaking sounds are heard from 4.368 to 5.790 seconds."
You can download the data from GoogleDrive: AudioTime(train) and AudioTime(test), or from BaiduNetDisk with the extraction code "time". The directory structure is:
AudioTime/
├── train/
│ ├── train5000_ordering/
│ │ ├── audio/
│ │ │ ├── syn_1.wav
│ │ │ ├── syn_2.wav
│ │ │ ├── ...
│ │ │ └── syn_5000.wav
│ │ └── ordering_captions.json
│ ├── ...
│ └── train5000_timestamp/
└── test/
The JSON files contain annotations, including audio_id, metadata, and GPT-generated captions. An example is shown below:
"syn_1": {
"event": {
"Electric shaver, electric razor": [
[
1.056,
5.158
]
],
"Jackhammer": [
[
7.66,
10.0
]
]
},
"caption": "An electric shaver buzzes from 1.056 to 5.158 seconds, followed by a jackhammer pounding from 7.66 to 10 seconds."
},
The dataset statistics are shown in the figure below.
STEAM is a text-based metric that evaluates whether the generated audio segments meet the control requirements specified by the input text. STEAM assesses control performance based on detected timestamps and the control signal provided by the input free text. The testing script is available at STEAMtool.
- Ordering: To determine whether the audio generates events A and B in the specified order, quantified by the error rate.
- Duration / frequency: Calculate the absolute error between the event duration/frequency in the generated audio and the value specified in the text, averaged over the total number of events.
- Timestamp: To measure the accuracy of controlling audio timestamps. Segment F1, a common metric in sound event detection tasks, is calculated using the detected and specified on- & off-set.
git clone https://github.com/zeyuxie29/AudioTime.git
cd STEAMtool
pip install -e .
Download the audio-text grounding (ATG) model checkpoint, and put it in the path /STEAMtool/steam/grounding_tool/grounding_ckpt/.
python steam/runner/steam_eval.py -p {generated_path} -t {task_name}
Where {generated_path} is the path to the generated audio files, and the audio file names need to correspond with the names in the caption files under /STEAMtool/data. {task_name} denotes the task type, with four tasks: "timestamp", "duration", "frequency", and "ordering"(default).