<a href="https://colab.research.google.com/github/tsakailab/sandbox/blob/master/VLM_demo/X_CLIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# X-CLIP trial
- [arXiv paper](https://arxiv.org/abs/2208.02816)
- [huggingface document (including original .ipynb code of this file)](https://huggingface.co/docs/transformers/en/model_doc/xclip)
- [microsoft github page](https://github.com/microsoft/VideoX/tree/master/X-CLIP)

In [None]:
!pip install -q decord

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.6/13.6 MB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## Load video

Here we load a video of people eating spaghetti.

In [None]:
from huggingface_hub import hf_hub_download
from ipywidgets import Video

file_path = hf_hub_download(
    repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
)
Video.from_file(file_path, width=500)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


eating_spaghetti.mp4:   0%|          | 0.00/1.01M [00:00<?, ?B/s]

Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x00\x0fI\xb7mdat\x00\x…

We'll randomly sample 8 frames from the video.

In [None]:
from decord import VideoReader, cpu
import numpy as np

np.random.seed(0)

def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices

vr = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 16 frames
vr.seek(0)
indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=len(vr))
video = vr.get_batch(indices).asnumpy()
print(video.shape)

(8, 360, 640, 3)


## Run inference

Finally, we forward the video + 3 possible texts through the X-CLIP model. The model will tell us how much each text matches with the given video.

In [None]:
from transformers import XCLIPProcessor, XCLIPModel
import torch

# model_name = "microsoft/xclip-base-patch32" # ViT-B/32
model_name = "microsoft/xclip-large-patch14" # biggest model ViT-L/14

processor = XCLIPProcessor.from_pretrained(model_name)
model = XCLIPModel.from_pretrained(model_name)

preprocessor_config.json:   0%|          | 0.00/310 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/927 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/8.90k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.30G [00:00<?, ?B/s]

In [None]:
inputs = processor(text=["playing sports", "eating spaghetti", "go shopping"], videos=list(video), return_tensors="pt", padding=True)

# forward pass
# about 30 seconds for ViT-L/14
# few seconds for ViT-B/32
with torch.no_grad():
    outputs = model(**inputs)

probs = outputs.logits_per_video.softmax(dim=1)
probs

Unused or unrecognized kwargs: padding.


tensor([[3.6946e-04, 9.9933e-01, 2.9842e-04]])