## Research paper - Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

### Key ideas and results

The paper introduces **Delayed Streams Modeling (DSM)** - a new approach for streaming sequence-to-sequence learning. The key hypothesis is that by aligning different modalities (like audio and text) to a shared framerate and introducing appropriate delays between them, you can achieve real-time streaming generation without needing complex alignment policies.

The main contributions are:
- A flexible framework that works for multiple tasks (ASR and TTS) with the same architecture
- State-of-the-art performance with low latency (as low as a few hundred milliseconds)
- Support for arbitrarily long sequences with batching capability

The core idea is surprisingly elegant. Instead of processing all input before generating output (offline) or learning complex policies about when to read/write (traditional streaming), DSM:

- **Aligns sequences**: Both audio and text are aligned to the same framerate (12.5 Hz)
- **Introduces delays**: The output stream is delayed by τ steps relative to the input, creating a "lookahead" window
- **Uses parallel streams**: A decoder-only transformer processes multiple token streams simultaneously

For **ASR** (speech-to-text): audio is the input stream, text is delayed and predicted
For **TTS** (text-to-speech): text is the input, audio is delayed and generated

Performance Results

**DSM-ASR (Automatic Speech Recognition):**
- Achieves **7.9% average WER** (Word Error Rate) on long-form datasets
- Competitive with or better than non-streaming models like Whisper-Large-V2 (9.0% WER)
- Outperforms other streaming models
- Provides word-level timestamps with 80ms precision

**DSM-TTS (Text-to-Speech):**
- Achieves **1.58-1.71% WER** when evaluated (lower is better)
- Speaker similarity score of **0.70-0.71** (comparable to baselines)
- Competitive with non-streaming models like F5-TTS and Cosyvoice
- Only model providing long-form synthesis in a streaming manner

Key Advantages

The paper emphasizes DSM provides:
- **Low latency**: As low as a few hundred milliseconds
- **Batching support**: Unlike most streaming models
- **Arbitrary length sequences**: No chunking needed
- **Single architecture**: Same model structure for both ASR and TTS

### How do you align audio and text on the same framerate

Great question! This is a crucial part of making DSM work.

For **audio**, they use a neural codec called **Mimi** that compresses the raw waveform:
- Takes audio at 24kHz sample rate
- Compresses it down to 12.5 Hz (so 12.5 "frames" per second)
- Each frame represents 80ms of audio
- Uses vector quantization to turn it into discrete tokens

For **text**, they use **word-level timestamps**:
- Each word has a timestamp indicating when it's spoken
- They place the word tokens at the corresponding frame position
- For example, if a word starts at 0.5 seconds, it goes at frame position 6 (0.5 × 12.5)
- They use special tokens: **WORD** (marks word start), **PAD** (empty frames between words)
- Frames without words get filled with PAD tokens

So if someone says "Hello world" where "Hello" starts at 0s and "world" starts at 0.4s, the aligned sequence might look like:
```
Frame 0: WORD, H, e, l, l, o
Frame 5: WORD, w, o, r, l, d
Frames 6+: PAD
```

The challenge they mention is that most speech datasets only have sentence-level timing, not word-level. 

### Text tokenization, and how text and audio tokens are aligned

Text Tokenization

They use a **custom vocabulary** specifically trained on speech transcription data (not a standard text tokenizer). The vocabulary has:
- Regular word tokens (vocabulary size 8000)
- Two special tokens: **PAD** and **WORD**

The Alignment Process

Here's how they align text to the 12.5 Hz audio framerate:

1. **Start with word-level timestamps**: Each word has a start time (e.g., "hello" starts at 0.24 seconds)

2. **Convert time to frame index**: Multiply the start time by the framerate
   - Example: 0.24s × 12.5 = frame 3

3. **Place tokens in the sequence**:
   - Put **WORD** token at the start frame
   - Follow immediately with the word's sub-tokens (like "h", "e", "l", "l", "o")
   - Fill any empty frames with **PAD**

When a word like "hello" is tokenized into sub-tokens [h, e, l, l, o], these tokens are placed **consecutively starting from the word's start frame**:

```
Frame 0: WORD
Frame 1: h
Frame 2: e
Frame 3: l
Frame 4: l
Frame 5: o
Frame 6: PAD (until next word)
```

So the word's tokens "flow forward" in time, occupying consecutive frames. The **WORD** token marks where a new word begins, then its sub-tokens follow in sequence.

This means:
- Short words might only take 1-2 frames
- Longer words could span many frames
- The actual pronunciation duration doesn't matter for the text stream - it's just sequential token placement
- PAD fills gaps between words

The audio stream, meanwhile, has tokens at **every** frame representing the actual sound at that moment.

During training, the model learns to predict text tokens that are **delayed by τ frames** relative to the audio. So if τ=16 frames (1.28 seconds), the model sees audio frames 0-16 before predicting text frame 0.

The audio stream has its own tokens at every frame (from the Mimi codec). Both streams now have exactly one "event" per 80ms time step.

### Training phases and datasets

For DSM-ASR:

**Pretraining:**
- 2.5 million hours of publicly available audio (English and French)
- Transcribed automatically using whisper-timestamped
- Trained on 90-second random segments

**Finetuning:**
- "A collection of public datasets with ground-truth transcripts" totaling 28k hours
- The paper mentions details are in "Appendix A.1"

**Long-form adaptation:**
- A special "long-form mixture" described in "Appendix A.2"

For DSM-TTS:

**Pretraining:**
- 150-second audio extracts from the same 2.5M hour collection

### Delay conditioning feature

The **delay conditioning** feature is a clever training trick that gives you flexibility at inference time.

The Problem

Normally, you'd train with a fixed delay (say τ=16 frames). But different use cases need different tradeoffs:
- **Low latency** (small delay): Faster response, but lower quality transcription
- **High quality** (large delay): Better transcription, but more lag

Without delay conditioning, you'd need to train a separate model for each delay value you want to support.

The Solution

Instead, they train **one model** on random delays:
- Each training example uses a different randomly sampled delay
- The model receives the delay value as an extra input (using a cosine embedding)
- The model learns: "given delay X, predict text accordingly"

At Inference

You simply tell the model what delay you want (e.g., 400ms for low latency, or 2 seconds for high quality), and it adjusts its predictions to match that latency/quality tradeoff.

Think of it like training a model that can operate at multiple "speeds" rather than just one fixed speed.

The delay conditioning feature lets you control the quality/latency tradeoff at inference time without retraining.

### Batching support

This is one of DSM's key practical advantages.

**The core insight:** DSM operates at a **constant framerate** (12.5 Hz). At each time step, the model processes exactly one frame for each stream, regardless of what's in it.

This means:
- Every example in a batch advances by exactly 1 frame per step
- All sequences stay synchronized
- You can run multiple audio streams through the model simultaneously

**Why other streaming models can't batch:**

Traditional streaming models use **policies** that decide "should I read more input or write output?" These decisions vary per example:
- Example 1 might need 3 input frames before writing
- Example 2 might write immediately
- They get out of sync, so you have to process them one at a time

**DSM's advantage:**

Since everything moves in lockstep (one frame per step for all streams), you can stack multiple examples and process them together efficiently on a GPU.

The paper notes this is "a feature rarely provided by streaming models" and helps with throughput.

### Speaker voices

This is specific to the TTS model and how it controls whose voice to generate.

Speaker Encoding Process

The model can handle **up to 5 different speakers** in a conversation. For each speaker:

1. **Extract a 10-second audio sample** of that speaker (from outside the training segment)
2. **Pass it through a speaker encoder** that produces a fixed-dimension embedding
3. The speaker encoder uses the same architecture as the Mimi codec encoder
4. Convolutional layers are frozen, but Transformer layers are fine-tuned

Conditioning the Model

The speaker embeddings are fed to the model through **cross-attention layers**:
- Concatenate embeddings from all speakers (up to 5)
- Add positional embeddings to distinguish which speaker is which
- Feed through cross-attention to the main backbone

If there are fewer than 5 speakers, they pad with learned embeddings. If more than 5, they randomly select 5.

Controlling Turns in Dialogue

They use special tokens to control who's speaking:
- **MAIN**: Marks when the primary speaker starts talking
- **OTHER**: Marks when another speaker takes over

At inference, you provide speaker embeddings for each person, then insert MAIN/OTHER tokens to orchestrate the conversation.

### Limitations

The paper identifies a few key limitations:

1. Need for Aligned Domains

The biggest limitation they mention is that **DSM requires aligned domains** - meaning you need data where audio and text have word-level timestamps. This "reduces the amount of gold-standard ground-truth data that can be used for training."

Most speech datasets only have sentence-level alignment, so they had to:
- Use pseudo-labels from Whisper for pretraining
- Apply Dynamic Time Warping to derive word alignments for finetuning

2. Independence Assumption

They note that "perfect independence is hard to achieve" - meaning the output at time t isn't truly independent of future input beyond the delay window. For example, in ASR, a named entity might be ambiguous without seeing more context.

3. Safety Concerns (TTS)

For their TTS model, they kept the speaker encoder closed-source due to impersonation risks. They acknowledge that voice cloning "opens up both opportunities in inclusive human-machine interactions and risks of fraudulent impersonation."

The paper mentions they'll extend DSM to more tasks in future work, suggesting the current scope (ASR and TTS) is somewhat limited.