## Local Inference on GPU
Model page: https://huggingface.co/SimulaMet/SoccerChat-qwen2-vl-7b


In [None]:
!pip install ms-swift bitsandbytes decord qwen_vl_utils==0.0.11

In [1]:
import os
import torch
from swift.llm import PtEngine, RequestConfig, InferRequest
from transformers import  BitsAndBytesConfig

In [2]:
# quantized for free T4 in Colab; paper reports performance on unquantized model.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",         # best accuracy for 4-bit
    bnb_4bit_use_double_quant=True,    # better compression
    bnb_4bit_compute_dtype=torch.float16
)

# model was trained in this setting
os.environ["FPS_MIN_FRAMES"]="24"
os.environ["FPS_MAX_FRAMES"]="24"
os.environ["VIDEO_MAX_PIXELS"]="100352"

In [3]:

engine = PtEngine(adapters=[ "SimulaMet/SoccerChat-qwen2-vl-7b"], quantization_config = bnb_config, attn_impl="sdpa", max_batch_size=1, use_hf=True, model_id_or_path="Qwen/Qwen2-VL-7B-Instruct", )

[INFO:swift] Downloading the model from HuggingFace Hub, model_id: Qwen/Qwen2-VL-7B-Instruct
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

[INFO:swift] Loading the model using model_dir: /root/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/eed13092ef92e448dd6875b2a00151bd3f7db0ac
`torch_dtype` is deprecated! Use `dtype` instead!
[INFO:swift] Setting torch_dtype: torch.bfloat16
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[INFO:swift] attn_impl: sdpa
[INFO:swift] model_kwargs: {'device_map': 'cuda:0', 'quantization_config': BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

[INFO:swift] Setting image_factor: 28. You can adjust this hyperparameter through the environment variable: `IMAGE_FACTOR`.
[INFO:swift] Setting min_pixels: 3136. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting max_pixels: 12845056. You can adjust this hyperparameter through the environment variable: `MAX_PIXELS`.
[INFO:swift] Setting max_ratio: 200. You can adjust this hyperparameter through the environment variable: `MAX_RATIO`.
[INFO:swift] Setting video_min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `VIDEO_MIN_PIXELS`.
[INFO:swift] Using environment variable `VIDEO_MAX_PIXELS`, Setting video_max_pixels: 100352.
[INFO:swift] Using environment variable `VIDEO_TOTAL_PIXELS`, Setting video_total_pixels: 90316800.
[INFO:swift] Setting frame_factor: 2. You can adjust this hyperparameter through the environment variable: `FRAME_FACTOR`.
[INFO:swift] Setting fps: 2.0. You can adjust this hyper

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/40.4M [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/219 [00:00<?, ?B/s]

additional_config.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

configuration.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

latest:   0%|          | 0.00/15.0 [00:00<?, ?B/s]

trainer_state.json: 0.00B [00:00, ?B/s]

sft_args.json: 0.00B [00:00, ?B/s]

zero_to_fp32.py: 0.00B [00:00, ?B/s]

[INFO:swift] Loading the model using model_dir: /root/.cache/huggingface/hub/models--SimulaMet--SoccerChat-qwen2-vl-7b/snapshots/ccbfced151749232f6f538417d0437509d362a28
[INFO:swift] Create the default_template for the infer_engine
[INFO:swift] default_system: 'You are a helpful assistant.'
[INFO:swift] max_length: 32768
[INFO:swift] response_prefix: ''
[INFO:swift] agent_template: hermes
[INFO:swift] norm_bbox: norm1000
[INFO:swift] Setting ROOT_IMAGE_DIR: None. You can adjust this hyperparameter through the environment variable: `ROOT_IMAGE_DIR`.


In [4]:
req_cfg = RequestConfig(max_tokens=512, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.05)

infer_requests = [
    InferRequest(messages=[{
    "role": "user",
    "content": [
        {"type": "video", "video": "https://huggingface.co/datasets/SimulaMet/SoccerChat/resolve/main/videos/MultipleEvents/100037_Shotsontarget--Balloutofplay.mp4"},
        {"type": "text", "text": "What is shown in the video?"}
    ],
}])
]
resp = engine.infer(infer_requests, req_cfg)
print(resp[0].choices[0].message.content)

qwen-vl-utils using decord to read video.
[INFO:swift] Successfully registered `/usr/local/lib/python3.12/dist-packages/swift/llm/dataset/data/dataset_info.json`.


The video shows a player from the white-jerseyed team attempting to make a pass into the box, but the defense of the blue/red stripe jersey team reacts quickly and clears the ball out of play, resulting in the white-jerseyed team being awarded a corner kick.


In [6]:
from IPython.display import HTML
HTML(f"""<video src="https://huggingface.co/datasets/SimulaMet/SoccerChat/resolve/main/videos/MultipleEvents/100037_Shotsontarget--Balloutofplay.mp4" width=500 controls/>""")