Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Both Word-Level and Sentence-Level Timestamps in ASR Decoding #855

Open
MatteoFasulo opened this issue Jul 20, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@MatteoFasulo
Copy link

Feature request

Hello,

I am currently using the _decode_asr function in your ASR decoding library (Whisper). This function provides an option to return either word-level or sentence-level (chunk-level) timestamps, but not both at the same time.

Motivation

In my use case, I need to have both word-level and sentence-level timestamps in the output. Word-level timestamps are useful for aligning individual words with the audio, while sentence-level timestamps are useful for aligning larger chunks of text.

Currently, I am able to get either word-level timestamps by setting return_timestamps to "word", or sentence-level timestamps by setting return_timestamps to true. However, there is no option to get both types of timestamps in the same output.

Your contribution

I suggest we adjust the _decode_asr function to provide an option for returning both word-level and sentence-level timestamps. This could potentially be achieved by extracting the sentence from the word-level timestamps, which would essentially be a collection of words. However, it appears that the model's output differs when set to word-level. I've noticed that no "timestamps" token is predicted.

For instance:
With return_timestamps: true, the predicted tokens are:

50257,50363,921,1053,1392,674,3241,13,1867,318,340,345,765,30,50453,50453,314,1183,16908,11,475,691,611,345,9149,42879,338,4925,13,50578,50703,921,1309,606,25188,1648,345,30,50773,50773,43048,470,307,881,329,16908,611,314,26643,13,50863,50256

For the same audio file, with return_timestamps: 'word', the predicted tokens are:

50257,50362,921,1053,1392,674,3241,13,1867,318,340,345,765,30,314,1183,16908,11,475,691,611,345,9149,42879,338,4925,13,921,1309,606,25188,1648,345,30,43048,470,307,881,329,16908,611,314,26643,13,50256

As you can see, there are no "timestamps" tokens (i.e., 50453), which, according to the tokenizer from the provided URL, should represent the following token:

{
      "id": 50453,
      "content": "<|1.80|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    }

This token represents the timestamp of that sentence from the BOS token.

So, my question is, is there a method to extract such sentence-level timestamps from the word-level output? It seems that generating word-level timestamps omits the sentence-level information, thereby preventing the retrieval of sentence-level timestamps.

@MatteoFasulo MatteoFasulo added the enhancement New feature or request label Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant