Skip to content

zeyuxie29/AudioTime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AudioTime

arXiv githubio

We present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. Additionally, we offer a comprehensive test set and evaluation metrics STEAM to assess the temporal control performance of various models.

Dataset

Audio samples can be found in the AudioTime-Demo. There are four types of alignment signals:

  1. Ordering: "A yip occurs, followed by a bleat after a short pause."
  2. Duration: "A water tap or faucet ran for 4.33 seconds."
  3. Frequency: "Sanding occurs once, followed by throat clearing twice."
  4. Timestamp: "An explosion occurs from 0.947 to 2.561 seconds, and then breaking sounds are heard from 4.368 to 5.790 seconds."

You can download the data from GoogleDrive: AudioTime(train) and AudioTime(test), or from BaiduNetDisk with the extraction code "time". The directory structure is:

AudioTime/
├── train/
│   ├── train5000_ordering/
│   │   ├── audio/
│   │   │   ├── syn_1.wav
│   │   │   ├── syn_2.wav
│   │   │   ├── ...
│   │   │   └── syn_5000.wav
│   │   └── ordering_captions.json
│   ├── ...   
│   └── train5000_timestamp/
└── test/

The JSON files contain annotations, including audio_id, metadata, and GPT-generated captions. An example is shown below:

"syn_1": {
        "event": {
            "Electric shaver, electric razor": [
                [
                    1.056,
                    5.158
                ]
            ],
            "Jackhammer": [
                [
                    7.66,
                    10.0
                ]
            ]
        },
        "caption": "An electric shaver buzzes from 1.056 to 5.158 seconds, followed by a jackhammer pounding from 7.66 to 10 seconds."
    },

The dataset statistics are shown in the figure below. image

STEAM:Strongly TEmporally-Aligned evaluation Metric

STEAM is a text-based metric that evaluates whether the generated audio segments meet the control requirements specified by the input text. STEAM assesses control performance based on detected timestamps and the control signal provided by the input free text. The testing script is available at STEAMtool.

  • Ordering: To determine whether the audio generates events A and B in the specified order, quantified by the error rate.
  • Duration / frequency: Calculate the absolute error between the event duration/frequency in the generated audio and the value specified in the text, averaged over the total number of events.
  • Timestamp: To measure the accuracy of controlling audio timestamps. Segment F1, a common metric in sound event detection tasks, is calculated using the detected and specified on- & off-set.

Install

git clone https://github.com/zeyuxie29/AudioTime.git
cd STEAMtool
pip install -e .

Download the audio-text grounding (ATG) model checkpoint, and put it in the path /STEAMtool/steam/grounding_tool/grounding_ckpt/.

Evaluation

  python steam/runner/steam_eval.py -p {generated_path} -t {task_name}

Where {generated_path} is the path to the generated audio files, and the audio file names need to correspond with the names in the caption files under /STEAMtool/data. {task_name} denotes the task type, with four tasks: "timestamp", "duration", "frequency", and "ordering"(default).

Result

We test some currently influential TTA generation models. image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published