feat!: major performance & accuracy improvements in speech-to-text module#1132
Conversation
|
Also if this PR adds breaking change, please describe it directly below |
c5d3c14 to
a91344c
Compare
|
Side note, after merging PR with TTS and rebasing, please make sure that native tests works here after all changes. |
6191212 to
02113ff
Compare
chmjkb
left a comment
There was a problem hiding this comment.
tested the demo app and works like a charm for iOS, thank u!
| const WHISPER_TINY_EN_TOKENIZER = `${URL_PREFIX}-whisper-tiny.en/${VERSION_TAG}/tokenizer.json`; | ||
| const WHISPER_TINY_EN_MODEL_XNNPACK = `${URL_PREFIX}-whisper-tiny.en/${VERSION_TAG}/xnnpack/whisper_tiny_en_xnnpack_fp32.pte`; | ||
| const WHISPER_TINY_EN_MODEL_COREML = `${URL_PREFIX}-whisper-tiny.en/${VERSION_TAG}/coreml/whisper_tiny_en_coreml_fp32.pte`; |
There was a problem hiding this comment.
We used to handle the backend selection automatically, as done for the style transfer, not a big problem as this is likely going to be re-written in the mogel registry PR cc @msluszniak
There was a problem hiding this comment.
Yeah, I will handle it in my PR.
|
Native tests fail to configure on this rebased branch — Please drop the |
02113ff to
6bba141
Compare
There was a problem hiding this comment.
Tested green on Android (demo app + native tests). A few suggestions inline, plus a few that touch lines unchanged by this PR — listed here since the lines aren't part of the diff and can't be commented inline:
src/types/stt.ts:10,12,14—SpeechToTextModelNameunion still lists'whisper-tiny-en-quantized' | 'whisper-base-en-quantized' | 'whisper-small-en-quantized'. The quantized constants are deleted in this PR, so these literals now type-check but cannot be constructed from any built-in. Worth dropping in the same breaking-change.common/rnexecutorch/models/speech_to_text/whisper/ASR.cpp:217(preexisting) — divisortokens.size() + 1matches neither a literal mean (scores.size()) nor OpenAI Whisper's formula (len(full_seq) + 1, wherefull_seqincludes SOT prefix and EOT). Worth picking one explicitly. For reference, whisper.cpp usessum_logprobs / result_len(no+1) — src/whisper.cpp:6602-6603.common/rnexecutorch/models/speech_to_text/whisper/ASR.cpp:308(preexisting) —std::mt19937 gen((std::random_device{}()))lives inside the autoregressive sampling loop, sorandom_deviceis consulted and a fresh Mersenne state is constructed for every sampled token. Hoist to a member (orstatic thread_local) seeded once pergenerate().common/rnexecutorch/models/speech_to_text/SpeechToText.h:38(preexisting) —transcribeStringOnlyis declared but never defined or referenced anywhere in the package; dead API surface, safe to drop.
Non-blocking — feel free to fold what you want into this PR or a follow-up.
The method was declared in SpeechToText.h but never defined or referenced anywhere in the package. Removing it cleans up the public API surface.
insertAudioChunk's overflow path was overwriting memory_.toCommit on each cap-hit. Two cap-hits before the next process() call silently dropped the first batch. Append instead of assign.
The previous tokens.size() + 1 matched neither a literal mean (would be scores.size()) nor OpenAI Whisper's formula (len(full_seq) + 1, where full_seq includes the SOT prefix and EOT). Align with whisper.cpp, which divides by the number of summed log-probs.
random_device was consulted and a fresh Mersenne state constructed for every sampled token. Seed once per generate() call instead.
The whisper-*-en-quantized constants are removed in this PR, but the SpeechToTextModelName union still accepted those literals — type-safe to pass, runtime-failing to use. Drop them from the union as part of the same breaking-change.
The header had bool enableTimestamps; the .cpp uses bool verbose (which matches the JS-side DecodingOptions.verbose). Rename here for consistency.
The streaming loop slept sleep_for(timeout) ms unconditionally between inferences, so streamStop() couldn't take effect until the next pause expired (final flush delayed by the full timeout). Replace with a condition_variable wait that streamStop() signals; inserts intentionally do not wake the loop, preserving the throttle.
msluszniak
left a comment
There was a problem hiding this comment.
Looks good, added some minor improvements and tested all on android. If you want, you can retest on demo app on iOS, up to you. Overall great job on this one :))
|
I tested demos on iOS before those changes and it worked good, guess ill retest tomorrow |
chmjkb
left a comment
There was a problem hiding this comment.
tested demo apps on iOS, works fine
…ware-mansion/react-native-executorch into @is/speech-to-text-ultimate
Description
This PR introduces several changes to the speech-to-text module based on Whisper models:
Introduces a breaking change?
Change: removes predefined constants for quantized models.
Justification: the quantized models differ very slightly from the original ones, introducing unnecessary complexity in this case.
Type of change
Tested on
Testing instructions
Run demo app to test the live streaming mode.
Screenshots
Related issues
#1124
Checklist
Additional notes