Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper model word-level timestamps broken #551

Open
1 of 5 tasks
BjoernRave opened this issue Jan 30, 2024 · 2 comments
Open
1 of 5 tasks

Whisper model word-level timestamps broken #551

BjoernRave opened this issue Jan 30, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@BjoernRave
Copy link

BjoernRave commented Jan 30, 2024

System Info

"@xenova/transformers": "^2.14.0",

macbook with M2 chip and MacOs Sonoma

Node.js: 20.11.0

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

I am running whisper like this:

export const speechToText = async (audio: Buffer) => {
  const float32Array = await convertAudioToFloat32Array(audio)
  env.allowLocalModels = false

  const transcriber = await pipeline(
    "automatic-speech-recognition",
    "Xenova/whisper-large-v3",
  )
  const output = await transcriber(float32Array, {
    return_timestamps: "word",
  })

  return output
}

However the returned word-level timestamps are all equal to the total duration of the audio file.

During the run my console also gets flooded with this kind of logs:

2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230677 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.20/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230684 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.12/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230693 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.6/final_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230701 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.11/encoder_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230708 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.4/encoder_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230716 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.1/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230741 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.0/final_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230761 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.10/encoder_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230773 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.8/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230782 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.3/final_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230797 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.6/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230806 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.2/self_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230814 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.0/self_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230844 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.17/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230855 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.19/encoder_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230865 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.0/encoder_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230873 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.5/final_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.

There is a releated PR in the python project: huggingface/transformers#25607

Reproduction

  1. Call whisper with return_timestamps: "word"
  2. Inspect output
@BjoernRave BjoernRave added the bug Something isn't working label Jan 30, 2024
@xenova
Copy link
Owner

xenova commented Feb 1, 2024

Call whisper with return_timestamps: "word"
Inspect output

Could you please provide a link to the audio file tested?

@wobbble
Copy link

wobbble commented Apr 3, 2024

Hey @xenova
Really big thanks for awesome project.
I also have wrong timestamps issue.
From my tests looks like stride param change fix it, but maybe it's deeper issue.

Whisper web with only
return_timestamps: "word",
Screenshot 2024-04-03 at 10 26 14
Whisper web with word level and fixed valuestride_length_s=3 at worker.js - line 160
instead of
stride_length_s: 3, //isDistilWhisper ? 3 : 5,
Screenshot 2024-04-03 at 10 19 20

Codesandbox link with changes that fix timestamp

Attaching audio file with which I have tested
output.wav.zip

Thanks and have a great day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants