Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 65 additions & 1 deletion openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6748,12 +6748,60 @@ paths:
"audio": "<base64_encoded_audio_chunk>"
}
```
- `input_audio_buffer.commit`: Signal end of audio stream
- `input_audio_buffer.commit`: Signal end of audio stream. When VAD is enabled, the server automatically detects speech boundaries and emits `completed` events. When VAD is disabled, you must send `commit` to trigger transcription of the buffered audio.
```json
{
"type": "input_audio_buffer.commit"
}
```
- `transcription_session.updated`: Update session configuration, including Voice Activity Detection (VAD) parameters. Send this after receiving `session.created`. Can also be sent at any time during the session to change VAD settings.
```json
{
"type": "transcription_session.updated",
"session": {
"turn_detection": {
"type": "server_vad",
"threshold": 0.3,
"min_silence_duration_ms": 500,
"min_speech_duration_ms": 250,
"max_speech_duration_s": 5.0,
"speech_pad_ms": 250
}
}
}
```
To disable VAD entirely (manual commit mode), set `turn_detection` to `null`:
```json
{
"type": "transcription_session.updated",
"session": {
"turn_detection": null
}
}
```

**Voice Activity Detection (VAD)**

VAD controls how the server automatically detects speech segments in the audio stream. When enabled (the default), the server uses Silero VAD to identify speech regions and emits transcription events as each segment completes. When disabled, you must manually call `input_audio_buffer.commit` to trigger transcription.

VAD can be configured in two ways:
1. **Query parameters** at connection time: `turn_detection=server_vad&threshold=0.3&min_silence_duration_ms=500`
2. **Session message** after connection: Send `transcription_session.updated` with a `turn_detection` object (see above)

To disable VAD at connection time, use `turn_detection=none` as a query parameter.

**VAD Parameters:**

All parameters are optional — omitted fields use their defaults.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `type` | string | `server_vad` | VAD mode. Use `server_vad` to enable, or set `turn_detection` to `null` to disable. |
| `threshold` | float | `0.3` | Speech probability threshold (0.0–1.0). Audio frames with probability above this value are classified as speech. Lower values detect more speech but may increase false positives. For low-SNR audio (e.g., 8kHz phone calls), values of 0.01–0.2 may work better. |
| `min_silence_duration_ms` | int | `500` | Minimum silence duration in milliseconds before ending a speech segment. Higher values merge nearby speech bursts into single segments. For phone calls with mid-sentence pauses, 2000–5000ms prevents over-segmentation. |
| `min_speech_duration_ms` | int | `250` | Minimum speech segment duration in milliseconds. Segments shorter than this are discarded. Filters out brief noise bursts or clicks. |
| `max_speech_duration_s` | float | `5.0` | Maximum speech segment duration in seconds. Segments longer than this are force-split at the longest internal silence gap. Useful for continuous speech without natural pauses. |
| `speech_pad_ms` | int | `250` | Padding in milliseconds added to the start and end of each detected segment. Prevents clipping speech edges. When padding would cause adjacent segments to overlap, the gap is split at the midpoint instead. |

**Server Events:**
- `session.created`: Initial session confirmation (sent first)
Expand All @@ -6768,6 +6816,22 @@ paths:
}
}
```
- `transcription_session.updated`: Confirms session configuration was applied. Sent in response to a client `transcription_session.updated` message.
```json
{
"type": "transcription_session.updated",
"session": {
"turn_detection": {
"type": "server_vad",
"threshold": 0.3,
"min_silence_duration_ms": 500,
"min_speech_duration_ms": 250,
"max_speech_duration_s": 5.0,
"speech_pad_ms": 250
}
}
}
```
- `conversation.item.input_audio_transcription.delta`: Partial transcription results
```json
{
Expand Down
Loading