Skip to content

Conversation

@msluszniak
Copy link
Member

@msluszniak msluszniak commented Jan 21, 2026

Description

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS simulator
  • Android simulator
  • iOS device
  • Android device

Testing instructions

Run demo app in apps/speech and run transcription for both timestamping and regular mode.

Screenshots

Related issues

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@msluszniak msluszniak self-assigned this Jan 21, 2026
@msluszniak msluszniak added the feature PRs that implement a new feature label Jan 21, 2026
@msluszniak msluszniak marked this pull request as draft January 21, 2026 14:58
@msluszniak msluszniak linked an issue Jan 21, 2026 that may be closed by this pull request
@msluszniak msluszniak marked this pull request as ready for review January 21, 2026 18:43
waveform: Float32Array | number[],
options: DecodingOptions = {}
): Promise<string> {
): Promise<string | Word[]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about instead of a type union, returning a single type? I checked OpenAI docs and for word-level timestamping they're doing something like this:

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "words": [
    {
      "word": "The",
      "start": 0.0,
      "end": 0.23999999463558197
    },
    ...
    {
      "word": "volleyball",
      "start": 7.400000095367432,
      "end": 7.900000095367432
    }
  ],
  "usage": {
    "type": "duration",
    "seconds": 9
  }
}

This is likely familiar for the user, if he ever used the OpenAI API, and the user doesnt have to merge words by themselves when using timestamps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we want to always return plain transcription and additionally list of Words if needed, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the second question is, Does OpenAI always return timestamps and full transcription, or is this optional as we have it right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's optional, so we only return if needed. It makes sense to me to match entirely the structure they're returning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add speech to text timestamping

3 participants