AudioTime

We present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. Additionally, we offer a comprehensive test set and evaluation metrics STEAM to assess the temporal control performance of various models.

Dataset

Audio samples can be found in the AudioTime-Demo. There are four types of alignment signals:

Ordering: "A yip occurs, followed by a bleat after a short pause."
Duration: "A water tap or faucet ran for 4.33 seconds."
Frequency: "Sanding occurs once, followed by throat clearing twice."
Timestamp: "An explosion occurs from 0.947 to 2.561 seconds, and then breaking sounds are heard from 4.368 to 5.790 seconds."

You can download the data from GoogleDrive: AudioTime(train) and AudioTime(test), or from BaiduNetDisk with the extraction code "time". The directory structure is:

AudioTime/
├── train/
│   ├── train5000_ordering/
│   │   ├── audio/
│   │   │   ├── syn_1.wav
│   │   │   ├── syn_2.wav
│   │   │   ├── ...
│   │   │   └── syn_5000.wav
│   │   └── ordering_captions.json
│   ├── ...   
│   └── train5000_timestamp/
└── test/

The JSON files contain annotations, including audio_id, metadata, and GPT-generated captions. An example is shown below：

"syn_1": {
        "event": {
            "Electric shaver, electric razor": [
                [
                    1.056,
                    5.158
                ]
            ],
            "Jackhammer": [
                [
                    7.66,
                    10.0
                ]
            ]
        },
        "caption": "An electric shaver buzzes from 1.056 to 5.158 seconds, followed by a jackhammer pounding from 7.66 to 10 seconds."
    },

The dataset statistics are shown in the figure below.

STEAM：Strongly TEmporally-Aligned evaluation Metric

STEAM is a text-based metric that evaluates whether the generated audio segments meet the control requirements specified by the input text. STEAM assesses control performance based on detected timestamps and the control signal provided by the input free text. The testing script is available at STEAMtool.

Ordering: To determine whether the audio generates events A and B in the specified order, quantified by the error rate.
Duration / frequency: Calculate the absolute error between the event duration/frequency in the generated audio and the value specified in the text, averaged over the total number of events.
Timestamp: To measure the accuracy of controlling audio timestamps. Segment F1, a common metric in sound event detection tasks, is calculated using the detected and specified on- & off-set.

Install

git clone https://github.com/zeyuxie29/AudioTime.git
cd STEAMtool
pip install -e .

Download the audio-text grounding (ATG) model checkpoint, and put it in the path /STEAMtool/steam/grounding_tool/grounding_ckpt/.

Evaluation

  python steam/runner/steam_eval.py -p {generated_path} -t {task_name}

Where {generated_path} is the path to the generated audio files, and the audio file names need to correspond with the names in the caption files under /STEAMtool/data. {task_name} denotes the task type, with four tasks: "timestamp", "duration", "frequency", and "ordering"(default).

Result

We test some currently influential TTA generation models.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
DemoFile		DemoFile
STEAMtool		STEAMtool
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioTime

Dataset

STEAM：Strongly TEmporally-Aligned evaluation Metric

Install

Evaluation

Result

About

Releases

Packages

Languages

zeyuxie29/AudioTime

Folders and files

Latest commit

History

Repository files navigation

AudioTime

Dataset

STEAM：Strongly TEmporally-Aligned evaluation Metric

Install

Evaluation

Result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages