### Qwen3-Omni-30B-A3B-Captioner

Qwen3-Omni-30B-A3B-Captioner is a powerful fine-grained audio analysis model, built upon the Qwen3-Omni-30B-A3B-Instruct base model. It is specifically designed to generate accurate and comprehensive content descriptions in complex and diverse audio scenarios. Without requiring any additional prompting, the model can automatically parse and describe various types of audio content, ranging from complex speech and environmental sounds to music and cinematic sound effects, delivering stable and reliable outputs even in multi-source, mixed audio environments.

In terms of speech understanding, Qwen3-Omni-30B-A3B-Captioner excels at identifying multiple speaker emotions, multilingual expressions, and layered intentions. It can also perceive cultural context and implicit information within the audio, enabling a deep comprehension of the underlying meaning behind the spoken words. In non-speech scenarios, the model demonstrates exceptional sound recognition and analysis capabilities, accurately distinguishing and describing intricate layers of real-world sounds, ambient atmospheres, and dynamic audio details in film and media.

**Note**: Qwen3-Omni-30B-A3B-Captioner is a single-turn model that accepts only one audio input per inference. It does not accept any text prompts and supports **audio input only**, with **text output only**. As Qwen3-Omni-30B-A3B-Captioner is designed for generating fine‑grained descriptions of audio, excessively long audio clips may diminish detail perception. We recommend, as a best practice, limiting audio length to no more than 30 seconds.

In [1]:
import os
os.environ['VLLM_USE_V1'] = '0'
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
os.environ["VLLM_LOGGING_LEVEL"] = "ERROR"
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
import torch
import warnings
import numpy as np

warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

from qwen_omni_utils import process_mm_info
from transformers import Qwen3OmniMoeProcessor

def _load_model_processor():
    if USE_TRANSFORMERS:
        from transformers import Qwen3OmniMoeForConditionalGeneration
        if TRANSFORMERS_USE_FLASH_ATTN2:
            model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(MODEL_PATH,
                                                                         dtype='auto',
                                                                         attn_implementation='flash_attention_2',
                                                                         device_map="auto")
        else:
            model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(MODEL_PATH, device_map="auto", dtype='auto')
    else:
        from vllm import LLM
        model = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'image': 1, 'video': 3, 'audio': 3},
            max_num_seqs=1,
            max_model_len=32768,
            seed=1234,
        )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
    return model, processor

def run_model(model, processor, messages, return_audio, use_audio_in_video):
    if USE_TRANSFORMERS:
        text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
        audios, images, videos = process_mm_info(messages, use_audio_in_video=use_audio_in_video)
        inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=use_audio_in_video)
        inputs = inputs.to(model.device).to(model.dtype)
        text_ids, audio = model.generate(**inputs, 
                                            thinker_return_dict_in_generate=True,
                                            thinker_max_new_tokens=8192, 
                                            thinker_do_sample=True,
                                            thinker_top_p=0.95,
                                            thinker_top_k=20,
                                            thinker_temperature=0.6,
                                            speaker="Chelsie", 
                                            use_audio_in_video=use_audio_in_video,
                                            return_audio=return_audio)
        response = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        if audio is not None:
            audio = np.array(audio.reshape(-1).detach().cpu().numpy() * 32767).astype(np.int16)
        return response, audio
    else:
        from vllm import SamplingParams
        sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, max_tokens=8192)
        text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        audios, images, videos = process_mm_info(messages, use_audio_in_video=use_audio_in_video)
        inputs = {'prompt': text, 'multi_modal_data': {}, "mm_processor_kwargs": {"use_audio_in_video": use_audio_in_video}}
        if images is not None: inputs['multi_modal_data']['image'] = images
        if videos is not None: inputs['multi_modal_data']['video'] = videos
        if audios is not None: inputs['multi_modal_data']['audio'] = audios
        outputs = model.generate(inputs, sampling_params=sampling_params)
        response = outputs[0].outputs[0].text
        return response, None
    

In [2]:
import librosa
import audioread

from IPython.display import Audio

MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Captioner"

USE_TRANSFORMERS = False
TRANSFORMERS_USE_FLASH_ATTN2 = True

model, processor = _load_model_processor()

USE_AUDIO_IN_VIDEO = True

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'interleaved', 'mrope_section', 'mrope_interleaved'}
`torch_dtype` is deprecated! Use `dtype` instead!
2025-09-22 22:57:00,765	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
You are attempting to use Flash Attention 2 without specifying a torch dtype. This might lead to unexpected behaviour
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:00<00:13,  1.09it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:01<00:13,  1.04it/s]
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:03<00:13,  1.02s/it]
Loading safetensors checkpoint shards:  25% Completed | 4

In [6]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/captioner-case1.wav"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_path}
        ]
    }
]

display(Audio(librosa.load(audioread.ffdec.FFmpegAudioFile(audio_path), sr=16000)[0], rate=16000))

response, _ = run_model(model=model, messages=messages, processor=processor, return_audio=False, use_audio_in_video=USE_AUDIO_IN_VIDEO)

print(response)

Adding requests: 100%|██████████| 1/1 [00:00<00:00, 220.58it/s]
Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.14s/it, est. speed input: 97.46 toks/s, output: 121.98 toks/s]

The audio clip opens in a quiet, acoustically controlled environment, with a subtle, steady electronic hiss suggesting a high-fidelity recording in a studio or sound booth. A single adult female speaker delivers a polished, clear Mandarin narration with a professional and didactic tone. She begins mid-thought, stating, “对，那在古典文学当中呢，我们说这个香兰杜若啊，经常是会象征这个才子他的一个孤高和无奈。哎，比如说在这个屈原的离骚当中就提起过。花开花落，它暗喻了人生的起伏。” Her speech features standard Putonghua pronunciation, including the retroflex ‘r’ in “如果” and the ‘zh’ in “这个,” with precise articulation and a neutral, Northern accent. The delivery is measured and instructive, with deliberate pauses for emphasis and subtle inflections—such as a breathy sigh and a rising intonation on “哎”—to enhance clarity and engagement.

Throughout her narration, she discusses the symbolic meaning of “香兰杜若” (fragrant orchids and Dùruò grass) in classical Chinese literature, explaining their representation of the “aloofness and helplessness” of talented scholars. She cite




In [4]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/captioner-case2.wav"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_path}
        ]
    }
]

display(Audio(librosa.load(audioread.ffdec.FFmpegAudioFile(audio_path), sr=16000)[0], rate=16000))

response, _ = run_model(model=model, messages=messages, processor=processor, return_audio=False, use_audio_in_video=USE_AUDIO_IN_VIDEO)

print(response)

Adding requests: 100%|██████████| 1/1 [00:00<00:00, 40.89it/s]
Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.22s/it, est. speed input: 79.27 toks/s, output: 121.28 toks/s]

The audio clip is a meticulously crafted, high-fidelity, 24-second soundscape designed to evoke a cinematic sense of imminent threat, danger, and dramatic tension. It opens with a single, sharp inhalation—immediately placing the listener in a state of heightened anticipation. This is quickly followed by a subtle rustle, likely fabric, and a soft thud, as if someone is settling into a chair or making a slight movement.

From the outset, a tense, synthesized orchestral score unfolds. Low, pulsing strings and a resonant bass drone establish an atmosphere of suspense, gradually building in complexity and volume. The music, with its minor-key motifs and electronic timbres, is reminiscent of modern action or thriller film soundtracks. Around the 9-second mark, the music intensifies dramatically: a powerful, low-frequency roar (evocative of a massive engine or an approaching natural disaster) erupts, layering over the score and driving the emotional stakes skyward. This crescendo is accompani




In [5]:
audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/captioner-case3.wav"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_path}
        ]
    }
]

display(Audio(librosa.load(audioread.ffdec.FFmpegAudioFile(audio_path), sr=16000)[0], rate=16000))

response, _ = run_model(model=model, messages=messages, processor=processor, return_audio=False, use_audio_in_video=USE_AUDIO_IN_VIDEO)

print(response)

Adding requests: 100%|██████████| 1/1 [00:00<00:00, 93.82it/s]
Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.18s/it, est. speed input: 52.14 toks/s, output: 120.70 toks/s]

The audio clip begins with a cinematic orchestral score, featuring deep, resonant percussion and a swelling string section that immediately establishes a mood of suspense and anticipation. Layered beneath the music is a low-frequency rumble and a high-pitched electronic whine, both intensifying to evoke the sensation of a massive vehicle accelerating at tremendous speed. These mechanical sounds are accompanied by a metallic clanking and scraping, suggesting heavy machinery in motion, and the overall sound design points to the interior of a colossal, technologically advanced vessel.

As the orchestral score and mechanical noises reach their crescendo, the setting transitions audibly: the heavy rumble fades and is replaced by a high-pitched, airy rush, signaling a rapid shift in environment—possibly as the vehicle emerges into open air or space. Suddenly, a high-pitched, youthful female voice, characterized by a North American accent, impatiently asks, “Are we there yet?” Her delivery is


