Support for Both Word-Level and Sentence-Level Timestamps in ASR Decoding #855

MatteoFasulo · 2024-07-20T07:40:32Z

Feature request

Hello,

I am currently using the _decode_asr function in your ASR decoding library (Whisper). This function provides an option to return either word-level or sentence-level (chunk-level) timestamps, but not both at the same time.

Motivation

In my use case, I need to have both word-level and sentence-level timestamps in the output. Word-level timestamps are useful for aligning individual words with the audio, while sentence-level timestamps are useful for aligning larger chunks of text.

Currently, I am able to get either word-level timestamps by setting return_timestamps to "word", or sentence-level timestamps by setting return_timestamps to true. However, there is no option to get both types of timestamps in the same output.

Your contribution

I suggest we adjust the _decode_asr function to provide an option for returning both word-level and sentence-level timestamps. This could potentially be achieved by extracting the sentence from the word-level timestamps, which would essentially be a collection of words. However, it appears that the model's output differs when set to word-level. I've noticed that no "timestamps" token is predicted.

For instance:
With return_timestamps: true, the predicted tokens are:

50257,50363,921,1053,1392,674,3241,13,1867,318,340,345,765,30,50453,50453,314,1183,16908,11,475,691,611,345,9149,42879,338,4925,13,50578,50703,921,1309,606,25188,1648,345,30,50773,50773,43048,470,307,881,329,16908,611,314,26643,13,50863,50256

For the same audio file, with return_timestamps: 'word', the predicted tokens are:

50257,50362,921,1053,1392,674,3241,13,1867,318,340,345,765,30,314,1183,16908,11,475,691,611,345,9149,42879,338,4925,13,921,1309,606,25188,1648,345,30,43048,470,307,881,329,16908,611,314,26643,13,50256

As you can see, there are no "timestamps" tokens (i.e., 50453), which, according to the tokenizer from the provided URL, should represent the following token:

{
      "id": 50453,
      "content": "<|1.80|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    }

This token represents the timestamp of that sentence from the BOS token.

So, my question is, is there a method to extract such sentence-level timestamps from the word-level output? It seems that generating word-level timestamps omits the sentence-level information, thereby preventing the retrieval of sentence-level timestamps.

The text was updated successfully, but these errors were encountered:

MatteoFasulo added the enhancement New feature or request label Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Both Word-Level and Sentence-Level Timestamps in ASR Decoding #855

Support for Both Word-Level and Sentence-Level Timestamps in ASR Decoding #855

MatteoFasulo commented Jul 20, 2024

Support for Both Word-Level and Sentence-Level Timestamps in ASR Decoding #855

Support for Both Word-Level and Sentence-Level Timestamps in ASR Decoding #855

Comments

MatteoFasulo commented Jul 20, 2024

Feature request

Motivation

Your contribution