Skip to content

feat!: major performance & accuracy improvements in speech-to-text module#1132

Merged
IgorSwat merged 21 commits into
mainfrom
@is/speech-to-text-ultimate
May 21, 2026
Merged

feat!: major performance & accuracy improvements in speech-to-text module#1132
IgorSwat merged 21 commits into
mainfrom
@is/speech-to-text-ultimate

Conversation

@IgorSwat
Copy link
Copy Markdown
Contributor

@IgorSwat IgorSwat commented May 8, 2026

Description

This PR introduces several changes to the speech-to-text module based on Whisper models:

  • CoreML integration - models re-exported to CoreML backend, bringing significant performance upgrade for iOS devices.
  • New streaming algorithm - eliminates duplicates in streaming output, resulting in a major quality improvement of the live streaming mode.
  • Changes in demo apps: removed faulty 'voice mode' screen in LLM demo app, refactored speech to text screen in 'speech' app by adding new CoreML models to selection bar and changing the default model for iOS devices.
  • Minor code improvements in speech-to-text module

Introduces a breaking change?

  • Yes
  • No

Change: removes predefined constants for quantized models.
Justification: the quantized models differ very slightly from the original ones, introducing unnecessary complexity in this case.

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

Run demo app to test the live streaming mode.

Screenshots

Related issues

#1124

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@IgorSwat IgorSwat requested review from benITo47, chmjkb and msluszniak May 8, 2026 08:26
@IgorSwat IgorSwat added model Issues related to exporting, improving, fixing ML models improvement PRs or issues focused on improvements in the current codebase labels May 8, 2026
Comment thread apps/speech/screens/SpeechToTextScreen.tsx Outdated
Comment thread apps/speech/package.json Outdated
Comment thread packages/react-native-executorch/src/constants/modelUrls.ts
@IgorSwat IgorSwat changed the title feat: major performance & accuracy improvements in speech-to-text module feat!: major performance & accuracy improvements in speech-to-text module May 8, 2026
@msluszniak
Copy link
Copy Markdown
Member

Also if this PR adds breaking change, please describe it directly below Introduces a breaking change? section in PR body.

@IgorSwat IgorSwat force-pushed the @is/speech-to-text-ultimate branch from c5d3c14 to a91344c Compare May 19, 2026 11:17
@msluszniak
Copy link
Copy Markdown
Member

Side note, after merging PR with TTS and rebasing, please make sure that native tests works here after all changes.

Copy link
Copy Markdown
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved by an accident

@IgorSwat IgorSwat force-pushed the @is/speech-to-text-ultimate branch from 6191212 to 02113ff Compare May 20, 2026 12:30
Copy link
Copy Markdown
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested the demo app and works like a charm for iOS, thank u!

Comment on lines +776 to +778
const WHISPER_TINY_EN_TOKENIZER = `${URL_PREFIX}-whisper-tiny.en/${VERSION_TAG}/tokenizer.json`;
const WHISPER_TINY_EN_MODEL_XNNPACK = `${URL_PREFIX}-whisper-tiny.en/${VERSION_TAG}/xnnpack/whisper_tiny_en_xnnpack_fp32.pte`;
const WHISPER_TINY_EN_MODEL_COREML = `${URL_PREFIX}-whisper-tiny.en/${VERSION_TAG}/coreml/whisper_tiny_en_coreml_fp32.pte`;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to handle the backend selection automatically, as done for the style transfer, not a big problem as this is likely going to be re-written in the mogel registry PR cc @msluszniak

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will handle it in my PR.

@msluszniak
Copy link
Copy Markdown
Member

Native tests fail to configure on this rebased branch — tests/CMakeLists.txt:265 still references models/speech_to_text/whisper/HypothesisBuffer.cpp, which this PR deletes:

CMake Error at CMakeLists.txt:137 (add_executable):
  Cannot find source file:

    .../models/speech_to_text/whisper/HypothesisBuffer.cpp

Call Stack (most recent call first):
  CMakeLists.txt:261 (add_rn_test)

Please drop the HypothesisBuffer.cpp source line (and any HypothesisBuffer.h includes in the speech-to-text test) so bash run_tests.sh builds again.

@IgorSwat IgorSwat force-pushed the @is/speech-to-text-ultimate branch from 02113ff to 6bba141 Compare May 20, 2026 15:46
Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested green on Android (demo app + native tests). A few suggestions inline, plus a few that touch lines unchanged by this PR — listed here since the lines aren't part of the diff and can't be commented inline:

  • src/types/stt.ts:10,12,14SpeechToTextModelName union still lists 'whisper-tiny-en-quantized' | 'whisper-base-en-quantized' | 'whisper-small-en-quantized'. The quantized constants are deleted in this PR, so these literals now type-check but cannot be constructed from any built-in. Worth dropping in the same breaking-change.
  • common/rnexecutorch/models/speech_to_text/whisper/ASR.cpp:217 (preexisting) — divisor tokens.size() + 1 matches neither a literal mean (scores.size()) nor OpenAI Whisper's formula (len(full_seq) + 1, where full_seq includes SOT prefix and EOT). Worth picking one explicitly. For reference, whisper.cpp uses sum_logprobs / result_len (no +1) — src/whisper.cpp:6602-6603.
  • common/rnexecutorch/models/speech_to_text/whisper/ASR.cpp:308 (preexisting) — std::mt19937 gen((std::random_device{}())) lives inside the autoregressive sampling loop, so random_device is consulted and a fresh Mersenne state is constructed for every sampled token. Hoist to a member (or static thread_local) seeded once per generate().
  • common/rnexecutorch/models/speech_to_text/SpeechToText.h:38 (preexisting) — transcribeStringOnly is declared but never defined or referenced anywhere in the package; dead API surface, safe to drop.

Non-blocking — feel free to fold what you want into this PR or a follow-up.

Comment thread packages/react-native-executorch/src/constants/modelUrls.ts
Comment thread packages/react-native-executorch/src/constants/modelUrls.ts
Mateusz Słuszniak added 7 commits May 20, 2026 19:22
The method was declared in SpeechToText.h but never defined or referenced
anywhere in the package. Removing it cleans up the public API surface.
insertAudioChunk's overflow path was overwriting memory_.toCommit on
each cap-hit. Two cap-hits before the next process() call silently
dropped the first batch. Append instead of assign.
The previous tokens.size() + 1 matched neither a literal mean (would be
scores.size()) nor OpenAI Whisper's formula (len(full_seq) + 1, where
full_seq includes the SOT prefix and EOT). Align with whisper.cpp,
which divides by the number of summed log-probs.
random_device was consulted and a fresh Mersenne state constructed for
every sampled token. Seed once per generate() call instead.
The whisper-*-en-quantized constants are removed in this PR, but the
SpeechToTextModelName union still accepted those literals — type-safe
to pass, runtime-failing to use. Drop them from the union as part of
the same breaking-change.
The header had bool enableTimestamps; the .cpp uses bool verbose (which
matches the JS-side DecodingOptions.verbose). Rename here for
consistency.
The streaming loop slept sleep_for(timeout) ms unconditionally between
inferences, so streamStop() couldn't take effect until the next pause
expired (final flush delayed by the full timeout). Replace with a
condition_variable wait that streamStop() signals; inserts intentionally
do not wake the loop, preserving the throttle.
Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, added some minor improvements and tested all on android. If you want, you can retest on demo app on iOS, up to you. Overall great job on this one :))

@chmjkb
Copy link
Copy Markdown
Collaborator

chmjkb commented May 20, 2026

I tested demos on iOS before those changes and it worked good, guess ill retest tomorrow

Copy link
Copy Markdown
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested demo apps on iOS, works fine

@IgorSwat IgorSwat merged commit d3182ce into main May 21, 2026
5 checks passed
@IgorSwat IgorSwat deleted the @is/speech-to-text-ultimate branch May 21, 2026 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement PRs or issues focused on improvements in the current codebase model Issues related to exporting, improving, fixing ML models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants