You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently using the _decode_asr function in your ASR decoding library (Whisper). This function provides an option to return either word-level or sentence-level (chunk-level) timestamps, but not both at the same time.
Motivation
In my use case, I need to have both word-level and sentence-level timestamps in the output. Word-level timestamps are useful for aligning individual words with the audio, while sentence-level timestamps are useful for aligning larger chunks of text.
Currently, I am able to get either word-level timestamps by setting return_timestamps to "word", or sentence-level timestamps by setting return_timestamps to true. However, there is no option to get both types of timestamps in the same output.
Your contribution
I suggest we adjust the _decode_asr function to provide an option for returning both word-level and sentence-level timestamps. This could potentially be achieved by extracting the sentence from the word-level timestamps, which would essentially be a collection of words. However, it appears that the model's output differs when set to word-level. I've noticed that no "timestamps" token is predicted.
For instance:
With return_timestamps: true, the predicted tokens are:
As you can see, there are no "timestamps" tokens (i.e., 50453), which, according to the tokenizer from the provided URL, should represent the following token:
This token represents the timestamp of that sentence from the BOS token.
So, my question is, is there a method to extract such sentence-level timestamps from the word-level output? It seems that generating word-level timestamps omits the sentence-level information, thereby preventing the retrieval of sentence-level timestamps.
The text was updated successfully, but these errors were encountered:
Feature request
Hello,
I am currently using the
_decode_asr
function in your ASR decoding library (Whisper). This function provides an option to return either word-level or sentence-level (chunk-level) timestamps, but not both at the same time.Motivation
In my use case, I need to have both word-level and sentence-level timestamps in the output. Word-level timestamps are useful for aligning individual words with the audio, while sentence-level timestamps are useful for aligning larger chunks of text.
Currently, I am able to get either word-level timestamps by setting
return_timestamps
to"word"
, or sentence-level timestamps by settingreturn_timestamps
totrue
. However, there is no option to get both types of timestamps in the same output.Your contribution
I suggest we adjust the
_decode_asr
function to provide an option for returning both word-level and sentence-level timestamps. This could potentially be achieved by extracting the sentence from the word-level timestamps, which would essentially be a collection of words. However, it appears that the model's output differs when set to word-level. I've noticed that no "timestamps" token is predicted.For instance:
With
return_timestamps: true
, the predicted tokens are:For the same audio file, with
return_timestamps: 'word'
, the predicted tokens are:As you can see, there are no "timestamps" tokens (i.e., 50453), which, according to the tokenizer from the provided URL, should represent the following token:
This token represents the timestamp of that sentence from the BOS token.
So, my question is, is there a method to extract such sentence-level timestamps from the word-level output? It seems that generating word-level timestamps omits the sentence-level information, thereby preventing the retrieval of sentence-level timestamps.
The text was updated successfully, but these errors were encountered: