Skip to content

Add WebSocket protocol guide for Realtime API#203

Draft
jaderf-sm wants to merge 3 commits intospeechmatics:mainfrom
jaderf-sm:issue/202-websocket-docs
Draft

Add WebSocket protocol guide for Realtime API#203
jaderf-sm wants to merge 3 commits intospeechmatics:mainfrom
jaderf-sm:issue/202-websocket-docs

Conversation

@jaderf-sm
Copy link

@jaderf-sm jaderf-sm commented Feb 18, 2026

Add WebSocket protocol guide for Realtime API

Why

TLDR see #202

The Realtime quickstart only covers Python and JavaScript SDKs. Developers working in Go, Rust, Java, or any other language have no tutorial-style guide to follow — only the API reference spec, which documents schemas but doesn't walk through the actual message flow.

What this PR adds

A new guide at Speech to text > Realtime > Guides > WebSocket protocol that walks a developer through a complete Realtime transcription session using raw WebSocket messages:

  1. How to connect and authenticate (both server-side and browser-side)
  2. How to start a session with StartRecognition (with minimal and full examples)
  3. How to stream audio as binary frames and handle server acknowledgements
  4. How to receive final and partial transcripts
  5. How to cleanly end the session

The guide also covers practical concerns that developers building their own clients need to know about:

  • Backpressure — what happens if you send audio faster than the server can process, and how to avoid connection drops (based on patterns from the Python SDK)
  • Session limits — the 48-hour, 1-hour idle, and 3-minute ping timeouts, including the warning messages the server sends before auto-terminating
  • Error handling — common WebSocket close codes and when to retry

Small cross-link additions were made to three existing pages so developers can discover the new guide:

  • Quickstart — info box at the top for developers looking for the raw protocol
  • Input page — new bullet in the "Next steps" section
  • API reference — tip after the protocol overview diagram

Open Technical Questions

  1. Which endpoint URL should we document? The existing docs use eu.rt.speechmatics.com but both SDKs default to eu2.rt.speechmatics.com. The guide currently uses the docs version.
  2. Is language needed in the URL path? The Python SDK connects to /v2/en but the JS SDK and API reference use just /v2/. The guide uses /v2/ and puts language in the StartRecognition body only.
  3. What does last_seq_no do on the server side? The guide documents the mechanics (set it to the last seq_no you received from AudioAdded) but a brief explanation of why would make the guide more helpful.
  4. Sample audio for testing — the guide links to the existing example.wav from the JS SDK repo and provides an ffmpeg command for raw PCM conversion. Would a hosted .raw file be better?

How to review

  1. Start the dev server (volta run npm start) and navigate to Speech to text > Realtime > Guides > WebSocket protocol
  2. Walk through the guide as if you were a developer in an unsupported language — does the message flow make sense? Are the JSON examples clear?
  3. Are the technical details in the documentation correct? (these were inferred from available SDK source code)
  4. Check the cross-links from the quickstart, input page, and API reference

Test plan

  • [volta run] npm run build passes with no broken links
  • Page renders correctly and sidebar shows "WebSocket protocol" under Guides
  • Mermaid diagram renders in both light and dark themes
  • Cross-links from quickstart, input, and API reference all work
  • SME review of technical details documented

Note: SME questions for review are marked as JSX comments ({/* SME-REVIEW */}) in the websocket-protocol.mdx source where these open questions apply. Search for SME-REVIEW to find them.

URLs to test

Page What to check
/speech-to-text/realtime/guides/websocket-protocol New page renders, Mermaid diagram displays, all links resolve
/speech-to-text/realtime/quickstart New :::info admonition links to WebSocket protocol guide
/speech-to-text/realtime/input New bullet in "Next steps" links to WebSocket protocol guide
/api-ref/realtime-transcription-websocket New :::tip after Mermaid diagram links to WebSocket protocol guide

Future improvements

The guide currently describes the raw WebSocket message flow without language-specific code samples. A natural next step is to add short, self-contained code snippets that demonstrate connecting to the API and running a basic transcription session.

Priority languages (no official SDK available):

  • Go — widely used for backend services and CLI tools
  • Java/Kotlin — dominant in enterprise and Android development
  • Swift — needed for native iOS/macOS clients
  • Ruby — popular in web development (Rails ecosystem), but probably lower priority as is losing marketshare quickly

Each sample would cover the full lifecycle: connect, authenticate, send StartRecognition, stream audio, receive transcripts, and send EndOfStream. The goal is a copy-paste-and-run experience, similar to how the existing quickstart works for Python and JavaScript.

Closes #202

@jaderf-sm jaderf-sm self-assigned this Feb 18, 2026
@jaderf-sm jaderf-sm added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 18, 2026
@vercel
Copy link

vercel bot commented Feb 18, 2026

@jaderfeijo is attempting to deploy a commit to the Speechmatics Team on Vercel.

A member of the Team first needs to authorize it.

@vercel
Copy link

vercel bot commented Feb 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Feb 18, 2026 2:09pm

Request Review


A US endpoint is also available at `wss://us.rt.speechmatics.com/v2`. See the [full endpoint list](/get-started/authentication#supported-endpoints) for all regions.

{/* <!-- SME-REVIEW: Python SDK appends language to path (/v2/en). JS SDK and API reference use /v2/ only. Confirming /v2/ is correct for raw WebSocket clients. --> */}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language is not needed in the URL path. This is the legacy URL patterns, which may still work fine, but we definitely should encourage the new scheme.


### 5. End the session

When you've finished sending audio, send an `EndOfStream` message. Set `last_seq_no` to the `seq_no` from the last `AudioAdded` message you received:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your question about what the server-side purpose of last_seq_no is: I believe it's so that we only return transcription results from the audio frames before that seq_no, to ensure a clean cutoff. I need to confirm that though,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Development

Successfully merging this pull request may close these issues.

Quickstart Shows SDK-Level Code But Not the Raw WebSocket Protocol

3 participants

Comments