### Screen Recording Interaction with Qwen2.5-Omni

This notebook demonstrates how to use Qwen2.5-Omni to get the information and content you want to know by asking questions in real time on the recording screen.

In [None]:
!pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
!pip install qwen-omni-utils
!pip install openai
!pip install flash-attn --no-build-isolation

In [None]:
from qwen_omni_utils import process_mm_info

# @title inference function
def inference(video_path, prompt, sys_prompt):
    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": [
                {"type": "text", "text": prompt},
                {"type": "video", "video": video_path},
            ]
        },
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    # image_inputs, video_inputs = process_vision_info([messages])
    audios, images, videos = process_mm_info(messages, use_audio_in_video=False)
    inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=False)
    inputs = inputs.to(model.device).to(model.dtype)

    output = model.generate(**inputs, use_audio_in_video=False, return_audio=False)

    text = processor.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    return text



Load model and processors.

In [3]:
import torch
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor

model_path = "Qwen/Qwen2.5-Omni-7B"
model = Qwen2_5OmniModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

  from .autonotebook import tqdm as notebook_tqdm
2025-03-22 17:14:12.353632: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-22 17:14:12.386228: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-22 17:14:12.386249: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-22 17:14:12.387082: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-22 17:14:12.3

In [4]:
from IPython.display import Video

#### Understanding

In [None]:
video_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/screen.mp4"
prompt = "What the browser is used in this video?"

display(Video(video_path, width=640, height=360))

## Use a local HuggingFace model to inference.
response = inference(video_path, prompt=prompt, sys_prompt="You are a helpful assistant.")
print(response[0])

qwen-vl-utils using torchvision to read video.


system
You are a helpful assistant.
user
What the browser is used in this video?
assistant
The browser used in the video is Google Chrome.


#### OCR

In [None]:
video_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/screen.mp4"
prompt = "Who is the authors of this paper?"

display(Video(video_path, width=640, height=360))

## Use a local HuggingFace model to inference.
response = inference(video_path, prompt=prompt, sys_prompt="You are a helpful assistant.")
print(response[0])



system
You are a helpful assistant.
user
Who is the authors of this paper?
assistant
The authors of the paper "Attention Is All You Need" are:

1. Ashish Vaswani
2. Noam Shazeer
3. Niki Parmar
4. Jakob Uszkoreit
5. Llion Jones
6. Aidan N. Gomez
7. Lukasz Kaiser
8. Illia Polosukhin


#### Summarize

In [None]:
video_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/screen.mp4"
prompt = "Summarize this paper in short."

display(Video(video_path, width=640, height=360))

## Use a local HuggingFace model to inference.
response = inference(video_path, prompt=prompt, sys_prompt="You are a helpful assistant.")
print(response[0])



system
You are a helpful assistant.
user
Summarize this paper in short.
assistant
The paper "Attention Is All You Need" introduces the Transformer model, a novel architecture for sequence-to-sequence tasks that relies entirely on self-attention mechanisms. The Transformer outperforms existing models in machine translation tasks, achieving state-of-the-art BLEU scores on WMT 2014 English-German and English-French translation tasks. The model is highly parallelizable, allowing for efficient training on large datasets and GPUs. The paper also discusses the application of the Transformer to other tasks, such as English constituency parsing, and highlights its potential for handling large inputs and outputs, such as images and audio.


#### Assistant

In [None]:
video_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/screen.mp4"
prompt = "Please trranslate the abstract of paper into Chinese."

display(Video(video_path, width=640, height=360))

## Use a local HuggingFace model to inference.
response = inference(video_path, prompt=prompt, sys_prompt="You are a helpful assistant.")
print(response[0])



system
You are a helpful assistant.
user
Please trranslate the abstract of paper into Chinese.
assistant
The abstract of the paper "Attention Is All You Need" by Vaswani et al. (2017) is as follows:

---

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based entirely on self-attention. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8