Audio Interaction Model

Today's Large Audio Language Models (LALMs) are stuck in an offline paradigm: you hand them a complete audio clip, wait, and get a reply. Streaming audio models exist, but each one only handles a single, isolated task. There has never been a general streaming audio language model. We formalize that missing capability as a new concept the Audio Interaction Model and build the first one. AudioInteraction is a unified Audio Interaction Model that:

✅ Runs conventional offline audio tasks (ASR, S2TT, AQA...)

✅ Runs streaming audio tasks in real time (Voice chatting...)

✅ Achieves general streaming audio instruction following on a live stream

✅ Does all of the above inside a single, all-in-one model, and be always-on and proactive

Technical Report 📖 / StreamAudio-2M 🤗 / AudioInteraction Model 🤗 / Streaming-Audio-Bench 🏆

▶ Click to watch AudioInteraction listen, decide, and speak — live (YouTube)

🔥 News

[Coming]: We will release the full dataset and data curation pipeline.
[Coming]: The full training configs and pipeline.
May 20, 2026: 🔥 We release StreamAudio-2M.
May 20, 2026: 🔥 We release the AudioInteraction Inference and Training Codebase.
May 19, 2026: 🔥 AudioInteraction model weights are now available on Hugging Face.
May 19, 2026: 🔥 We release the AudioInteraction Technical Report.

⚡ Quick Start

AudioInteraction is an always-on model: it keeps listening to incoming audio frames and decides for itself when to speak. By default it stays in a ⟨Silent⟩ state and only emits output when the task or the acoustic context warrants it — so you can open a single session, stream audio into it continuously, and watch every capability take turns on its own.

Installation

git clone https://github.com/AudioInteraction/AudioInteraction.git
cd AudioInteraction

conda create -n AudioInteraction python=3.12 -y
conda activate AudioInteraction
# please check if you are using torch-cuda
pip install -r requirements.txt
# install ffmpeg
conda install -c conda-forge ffmpeg

Download Weights

# download model weights from huggingface
export PYTHONPATH=./
python download.py

Inference and WebUI

Run inference first, then start the WebUI demo.

# Add project root to PYTHONPATH
export PYTHONPATH=./

# 1. Offline inference
python infer_offline.py

# To test bundled samples, set input_path in infer_offline.py to one of:
# sample/01_count_bark/sequence.json
# sample/02_translate/sequence.json
# sample/03_cough_music/sequence.json

# 2. Real-time inference
python infer_online.py

WebUI real-time demo

# Download model weights from Hugging Face first
python web/server.py

# Open in browser:
# http://localhost:5001

🎬 Demos

Most audio models do one job and wait to be asked. AudioInteraction's defining trait is that all of its abilities live in the same continuous stream, and the model itself decides which one is needed at each moment. The demo below is one unbroken session, one model, no mode switches, no prompts — transcription, understanding, conversation, and proactive intervention simply happen as the soundscape changes.

Capability 1 — Online audio understanding

Input (streaming)	gpt-audio	doubao-voicechat	gemini-omni	AudioInteraction (Ours)
Continuous ambient audio: footsteps, a door opening, distant traffic.	❌ Record-then-infer: waits for the clip to end, then returns one summary — no incremental narration.	⚠️ Speech-centric: lumps non-speech into "background noise" and misses individual events.	⚠️ Buffers a fixed window first, so narration lags several seconds behind the sound.	✅ Detects each event incrementally and narrates the scene in real time, without waiting for the clip to end.

Capabilities 2 – 4 (transcription & translation · full-spectrum chat · proactive intervention)

Capability 2 — Real-time transcription & translation

Input (streaming)	gpt-audio	doubao-voicechat	gemini-omni	AudioInteraction (Ours)
A speaker talking continuously while the model listens.	⚠️ Clean transcript, but only after the utterance finishes — no mid-sentence partials.	⚠️ Streams ASR well, but translation is turn-based and only fires at sentence boundaries.	⚠️ Emits chunks but re-decodes aggressively, causing flicker and unstable partials.	✅ Emits partial transcripts and translations chunk by chunk with low latency, correcting incrementally as context arrives.

Capability 3 — Voice chat beyond speech

Input (streaming)	gpt-audio	doubao-voicechat	gemini-omni	AudioInteraction (Ours)
A user asks about a song playing in the background while talking.	⚠️ Hears the speech but ignores the music — answers as if no song were playing.	❌ Treats the music as noise to suppress; can't reason about it.	⚠️ Can ID the song in isolation, but can't fuse it with the ongoing conversation.	✅ Jointly perceives speech, music, and general audio, and responds in a context-aware, full-spectrum conversation.

Capability 4 — Proactive intervention

Input (streaming)	gpt-audio	doubao-voicechat	gemini-omni	AudioInteraction (Ours)
A smoke alarm starts beeping while the user is silent.	❌ Stays silent — only responds when prompted; no self-initiated speech.	❌ Waits for a wake word / user turn; never volunteers a warning.	❌ No notion of when to speak; requires an explicit query.	✅ Holds `⟨Silent⟩` until the acoustic cue appears, then switches to `⟨Speak⟩` and warns the user — no prompt required.

⚙️ SoundFlow: Train your own Audio Interaction Model

Offline audio models answer a finished clip, but real audio needs a model that listens continuously and decides, moment to moment, whether to speak. SoundFlow trains a single model that at every chunk chooses between ⟨Speak⟩ and ⟨Silent⟩, so recognition, translation, and dialogue become instructions inside one always-on perceive–decide–respond loop — a Large Audio Interaction Model (LAIM) — instead of separate per-task models. The framework covers the whole pipeline: stitching short clips into long interactions for data, chunk-level decision training with history review and comprehension-aware silence, and asynchronous FIFO inference that cuts first-frame latency by 4.5×.

🔧 Finetuning ** data samples are in /src/audiointeraction/dataset/examples

You can fine-tune AudioInteraction on your own streaming data, and you can also use this repository to train standard offline audio language models. There are two steps: build the training data, then train.

1. Prepare training data

Edit the path constants at the top of each script first:

File	Constants to fill in
`src/audiointeraction/dataset/get_feat.py`	`QWEN_OMNI_CKPT`, `AUDIO_TOWER_CKPT`
`src/audiointeraction/dataset/get_dataset_online.py`	`QWEN_OMNI_CKPT`
`src/audiointeraction/dataset/get_dataset_offline.py`	`QWEN_OMNI_CKPT`, `AUDIO_TOWER_CKPT`

Input JSONL format

Online (streaming, multi-turn audio). One JSON object per line:

{"conversation": [
    {"audio_path": "/path/to/turn1.wav", "assistant": "reply 1", "emotion": "normal"},
    {"audio_path": "/path/to/turn2.wav", "assistant": "reply 2", "emotion": "happy"}
]}

audio_path and assistant are required on every turn.
emotion is optional and defaults to "normal". Allowed values: happy, sad, angry, surprise, normal, urgent.
To make the model stay silent on a turn, set assistant to "<no need to response>".

A single-turn shorthand is also accepted:

{"merge_path": "/path/to/audio.wav", "assistant": "reply", "emotion": "normal"}

Offline (single-turn). One JSON object per line, either the flat form:

{"user": "user text", "assistant": "reply", "audio_path": "/path/to/audio.wav"}

or the online-style multi-turn shape, in which case only the first turn is used:

{"conversation": [{"user": "...", "assistant": "...", "audio_path": "..."}, ...]}

assistant is always required. The task variant is decided by which other fields are present:

Has `audio_path`?	Has `user`?	Task
✓	✓	`A_T_T` — audio + user text → assistant
✓		`A_T` — audio → assistant
	✓	`T_T` — user text → assistant

Data process

# Online: <input.jsonl> <output.jsonl> <error.log> <feature_dir>
python src/audiointeraction/dataset/get_dataset_online.py \
    <input.jsonl> <output.jsonl> <error.log> <feature_dir>
# Example:
# python src/audiointeraction/dataset/get_dataset_online.py \
#     data/online_raw.jsonl data/online.jsonl logs/online.err features/online

# Offline: <input.jsonl> <output.jsonl> <error.log> <feature_dir>
python src/audiointeraction/dataset/get_dataset_offline.py \
    <input.jsonl> <output.jsonl> <error.log> <feature_dir>
# Example:
python src/audiointeraction/dataset/get_dataset_offline.py \
#     data/offline_raw.jsonl data/offline.jsonl logs/offline.err features/offline

Both scripts are resumable: re-running picks up where the previous run stopped, skipping any idx that was already written. For a parallel multi-GPU template, see src/audiointeraction/dataset/process_get_feature.sh.

2. Train

# 1. Set the two data roots referenced by config.yaml
export DATA_ROOT=/path/to/your/jsonl/data
export CHECKPOINT_ROOT=/path/to/your/checkpoints
# Example:
# export DATA_ROOT=/data/audiointeraction/jsonl
# export CHECKPOINT_ROOT=/data/audiointeraction/ckpts

# 2. Edit hyperparameters / data sources in src/audiointeraction/finetune/config.yaml

# 3. Launch
python src/audiointeraction/finetune/full.py --config src/audiointeraction/finetune/config.yaml
# Example:
# python src/audiointeraction/finetune/full.py --config src/audiointeraction/finetune/config.yaml

🎊 StreamAudio-2M: a large-scale stream audio instruction following corpus

StreamAudio-2M is a ~2.6M-item streaming instruction-following corpus (7.4M rounds, 66.7K hours) covering seven capabilities — audio understanding, real-time ASR, speech translation, voice chatting, proactive response, and environment-aware agent — built by collecting clips from real-world datasets (AudioSet, CommonVoice, CoVoST2, MOSS, …), synthesizing text into speech with CosyVoice, then concatenating them into streaming sequences with environmental noise and token-level annotation.

Sample structure

Each line is one streaming sequence made of multiple turns:

{
  "id": "voice_chatting_000123",
  "stream_scene_type": "Home Smart",
  "num_turns": 2,
  "turns": [
    {
      "user": "Turn the living room lights down a bit.",
      "assistant": "Sure, dimming them to 40%.",
      "emotion": "normal",
      "scene_type": "Home Smart",
      "audio_path": "voice_chatting/000123/turn_0.wav"
    },
    {
      "user": "Thanks. What's the temperature in here?",
      "assistant": "It's 22.5 degrees in the living room.",
      "emotion": "normal",
      "scene_type": "Home Smart",
      "audio_path": "voice_chatting/000123/turn_1.wav"
    }
  ]
}

Set assistant to "<no need to response>" for a turn where the model should stay silent.

📊 Experimental results of Audio-Interaction

Table 1: Results on MMAU Benchmark

Model	Size	Stream.	Multi-turn	Text Sound	Text Music	Text Speech	Text Avg.	Audio Sound	Audio Music	Audio Speech	Audio Avg.
*Large Audio Language Models*
Audio Flamingo 2	3B	✗	✗	71.47	70.96	44.74	62.40	1.50	1.49	0.35	1.16
Qwen2-Audio-Instruct	8.4B	✗	✓	54.95	50.98	42.04	49.20	22.32	19.16	16.31	19.41
Voxtral-Mini	3B	✗	✓	58.56	49.70	43.53	50.60	46.08	34.13	30.50	37.24
Audio-Reasoner	8.4B	✗	✗	60.06	64.30	60.70	61.71	20.48	26.65	13.48	20.57
*Omni Language Models*
Qwen2.5-Omni	3B	✗	✓	65.36	48.94	57.78	57.81	51.81	44.01	29.79	42.51
Qwen2.5-Omni	7B	✗	✓	67.87	69.16	59.76	65.60	60.54	50.90	35.11	49.58
Phi-4-multimodal	7B	✗	✓	60.97	52.87	52.83	55.56	44.65	27.84	21.99	31.75
Baichuan-Omni-1.5	11B	✗	✓	65.47	58.98	55.26	59.90	57.53	36.53	24.82	40.40
*Streaming Audio Language Models*
Audio-Interaction	3B	✓	✓	64.12	47.80	55.13	55.68	65.63	57.93	39.68	58.15

Table 2: Performance on Spoken-Dialogue Benchmarks

Model	Size	SpokenQA LLa. Q.	SpokenQA Web Q.	Voicebench Alpa.	Voicebench SD-QA
*Specialized Models*
Moshi	7B	62.20	26.30	2.01	15.01
Freeze-Omni	7B	72.00	44.73	4.14	50.16
*Omni & Audio Language Models*
Baichuan-Omni-1.5	7B	78.50	59.10	4.50	43.40
Qwen2-Audio	7B	69.67	45.20	3.74	35.71
Qwen2.5-Omni	3B	66.00	27.95	4.32	49.37
Qwen2.5-Omni	7B	75.33	62.80	4.49	55.71
Phi-4-multimodal	7B	60.2	26.6	3.81	39.78
*Streaming Audio Language Models*
Audio-Interaction	3B	67.31	54.34	4.28	52.14

Table 3: ASR WER and S2TT BLEU on LibriSpeech and CoVoST2

Model	Size	ASR Clean ↓	ASR Other ↓	S2TT en-zh ↑	S2TT zh-en ↑
*Specialized Models*
Canary	1B	1.48	2.93	-	-
Canary-Qwen	2.5B	1.49	3.10	-	-
*Omni & Audio Language Models*
Baichuan-Omni-1.5	7B	5.71	10.09	-	-
Qwen2-Audio	7B	1.60	3.60	45.20	24.40
Qwen2.5-Omni	3B	2.87	5.90	39.50	18.17
Qwen2.5-Omni	7B	1.80	3.40	41.40	29.40
Phi-4-multimodal	5.6B	1.69	3.82	46.30	22.39
*Streaming Audio Language Models*
Audio-Interaction	3B	3.17	6.04	55.22	35.21

Acknowledgements

We sincerely thank the creators, maintainers, and contributors of the public datasets and resources used in this work. We also thank the broader large audio language model community for laying the groundwork that made streaming audio modeling possible.

In particular, this project builds on the following open-source repositories:

Qwen2.5-Omni — the audio encoder and language model backbone behind AudioInteraction.
LitGPT — the training framework our finetuning code is built on.
CosyVoice — the text-to-speech model used to synthesize speech during data construction.

License, Citation & Stars

This project will be released under the Apache-2.0 License. You can do everything with AudioInteraction 🎉

Citation: You can cite AudioInteraction using the following BibTeX entry. Thank you for your kindness 🙂

@misc{xie2026audiointeractionmodel,
      title={Audio Interaction Model}, 
      author={Zhifei Xie and Zihang Liu and Ze An and Xiaobin Hu and Yue Liao and Ziyang Ma and Dongchao Yang and Mingbao Lin and Deheng Ye and Shuicheng Yan and Chunyan Miao},
      year={2026},
      eprint={2606.05121},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2606.05121}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.vscode		.vscode
__pycache__		__pycache__
assets		assets
docs		docs
sample		sample
src/audiointeraction		src/audiointeraction
web		web
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
README_ZH.md		README_ZH.md
download.py		download.py
infer.py		infer.py
infer_offline.py		infer_offline.py
infer_online.py		infer_online.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Interaction Model

🔥 News

Contents

⚡ Quick Start

Inference and WebUI

WebUI real-time demo

🎬 Demos

Capability 1 — Online audio understanding

Capability 2 — Real-time transcription & translation

Capability 3 — Voice chat beyond speech

Capability 4 — Proactive intervention

⚙️ SoundFlow: Train your own Audio Interaction Model

🔧 Finetuning ** data samples are in /src/audiointeraction/dataset/examples

1. Prepare training data

Input JSONL format

Data process

2. Train

🎊 StreamAudio-2M: a large-scale stream audio instruction following corpus

Sample structure

📊 Experimental results of Audio-Interaction

Table 1: Results on MMAU Benchmark

Table 2: Performance on Spoken-Dialogue Benchmarks

Table 3: ASR WER and S2TT BLEU on LibriSpeech and CoVoST2

Acknowledgements

License, Citation & Stars

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio Interaction Model

🔥 News

Contents

⚡ Quick Start

Inference and WebUI

WebUI real-time demo

🎬 Demos

Capability 1 — Online audio understanding

Capability 2 — Real-time transcription & translation

Capability 3 — Voice chat beyond speech

Capability 4 — Proactive intervention

⚙️ SoundFlow: Train your own Audio Interaction Model

🔧 Finetuning ** data samples are in /src/audiointeraction/dataset/examples

1. Prepare training data

Input JSONL format

Data process

2. Train

🎊 StreamAudio-2M: a large-scale stream audio instruction following corpus

Sample structure

📊 Experimental results of Audio-Interaction

Table 1: Results on MMAU Benchmark

Table 2: Performance on Spoken-Dialogue Benchmarks

Table 3: ASR WER and S2TT BLEU on LibriSpeech and CoVoST2

Acknowledgements

License, Citation & Stars

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages