# Visualize

#### Recommended: Just open in Colab
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1AP09NW6a4gz_InGDx1fXiSZhGQHCVFLw#scrollTo=8K6d2sj0ne6R)

In this notebook, we will visualize the image prediction score per frame. This visualization aims to describe the behaviour of the video search system when trying to predict or find the match to our query.

The contents are:
1. Libraries Installation
2. Import Libraries and Load Model
3. Inference Function
4. Visualization

## 1. Installation

In [1]:
!pip install gradio
!pip install git+https://github.com/openai/CLIP.git
# !pip install git+https://github.com/salesforce/LAVIS.git
!pip install ftfy
!pip install regex 
!pip install tqdm
!pip install imageio-ffmpeg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gradio
  Downloading gradio-3.10.1-py3-none-any.whl (11.6 MB)
[K     |████████████████████████████████| 11.6 MB 24.0 MB/s 
[?25hCollecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting pycryptodome
  Downloading pycryptodome-3.15.0-cp35-abi3-manylinux2010_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 52.5 MB/s 
Collecting orjson
  Downloading orjson-3.8.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (272 kB)
[K     |████████████████████████████████| 272 kB 63.0 MB/s 
[?25hCollecting websockets>=10.0
  Downloading websockets-10.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 3.7 MB/s 
Collecting paramiko
  Downloading paramiko-2.12.0-py2.py3-none-any.whl (213 kB)
[K     |████████████████████████████████

## 2. Import Libraries, Load Model, Download Data

In [6]:
# Import Libraries
import os
os.system("pip freeze")
import cv2
from PIL import Image
import clip
import torch
import math
import numpy as np
import torch
import datetime

# Load the open CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Download Data
!gdown https://drive.google.com/uc?id=1lcSnGJD0ubCZq74rytupfps-qk6f4V4w

Downloading...
From: https://drive.google.com/uc?id=1lcSnGJD0ubCZq74rytupfps-qk6f4V4w
To: /content/car_crash.mp4
  0% 0.00/511k [00:00<?, ?B/s]100% 511k/511k [00:00<00:00, 146MB/s]


## 3. Inference Function

In [7]:
def inference_text(video, text):
  # The frame images will be stored in video_frames
  video_frames = []
  # Open the video file
  
  capture = cv2.VideoCapture(video)
  fps = capture.get(cv2.CAP_PROP_FPS)
  
  current_frame = 0
  # Read the current frame
  ret, frame = capture.read()
  while capture.isOpened() and ret:
      ret,frame = capture.read()
      # print('Read a new frame: ', ret)
      current_frame += 1
      if ret:
        video_frames.append(Image.fromarray(frame[:, :, ::-1]))

  # Print some statistics
  print(f"Frames extracted: {len(video_frames)}")
  
  # You can try tuning the batch size for very large videos, but it should usually be OK
  batch_size = 256
  batches = math.ceil(len(video_frames) / batch_size)
  
  # The encoded features will bs stored in video_features
  video_features = torch.empty([0, 512], dtype=torch.float16).to(device)
  
  # Process each batch
  for i in range(batches):
    print(f"Processing batch {i+1}/{batches}")
  
    # Get the relevant frames
    batch_frames = video_frames[i*batch_size : (i+1)*batch_size]
    
    # Preprocess the images for the batch
    batch_preprocessed = torch.stack([preprocess(frame) for frame in batch_frames]).to(device)
    
    # Encode with CLIP and normalize
    with torch.no_grad():
      batch_features = model.encode_image(batch_preprocessed)
      batch_features /= batch_features.norm(dim=-1, keepdim=True)
  
    # Append the batch to the list containing all features
    video_features = torch.cat((video_features, batch_features))
  
  # Print some stats
  print(f"Features: {video_features.shape}")
 
  search_query=text
  display_heatmap=False
  display_results_count=1
  # Encode and normalize the search query using CLIP
  with torch.no_grad():
    text_features = model.encode_text(clip.tokenize(search_query).to(device))
    text_features /= text_features.norm(dim=-1, keepdim=True)

  # Compute the similarity between the search query and each frame using the Cosine similarity
  similarities = (100.0 * video_features @ text_features.T)
  values, best_photo_idx = similarities.topk(display_results_count, dim=0)
  print("Values: ", values)
  print("Best photo idx: ", best_photo_idx)

  for frame_id in best_photo_idx:
    frame = video_frames[frame_id]
    # Find the timestamp in the video and display it
    seconds = round(frame_id.cpu().numpy()[0]/fps)
  return frame,f"Found at {str(datetime.timedelta(seconds=seconds))}", similarities, fps

## 4. Visualization

### 4.1. Visualize Prediction Value for Each Frame

In [8]:
video = '/content/car_crash.mp4'
text = 'vehicle crash'

video_clip_result = video.replace('.mp4', '_clip.mp4')
frame, seconds, similarities, fps = inference_text(video, text)

Frames extracted: 417
Processing batch 1/2
Processing batch 2/2
Features: torch.Size([417, 512])
Values:  tensor([[31.7812]], device='cuda:0', dtype=torch.float16)
Best photo idx:  tensor([[186]], device='cuda:0')


In [9]:
import pandas as pd
import plotly.express as px

similarities_list = (similarities.flatten()).tolist()
frames_list = [i for i in range(0,len(similarities_list))]

df = pd.DataFrame({'Frames':frames_list, 'Similarities':similarities_list})
fig = px.line(df, x='Frames', y='Similarities', title="Similarities over Frame")
fig.update_layout(
    font=dict(size=20)
)
fig.show()

### 4.2. Visualize Similarity over Seconds

In [10]:
ten_frames_list = []
i = 10
while i < len(similarities_list):
  
  list_10 = similarities_list[i-10:i]
  element = sum(list_10)/10.0
  ten_frames_list.append(element)
  i += 10

# print(len(ten_frames_list))
factor = int(round(fps))/10
seconds_list = [i/factor for i in range(0,len(ten_frames_list))]
# print(seconds_list)

df = pd.DataFrame({'Seconds':seconds_list, 'Similarities':ten_frames_list})
fig = px.line(df, x='Seconds', y='Similarities', title="Similarities over Seconds")
fig.update_layout(
    font=dict(size=20)
)
fig.show()

### 4.3. Get the Clip where Query Happens

In [11]:
# Get the frame location
import math

frame_max = int(torch.argmax(similarities.flatten()))
print(frame_max)
print(fps)

start_time = int(round(frame_max/fps)) - 3
end_time = int(round(frame_max/fps)) + 3

# print(end_time)
# print(int(round(len(similarities)/fps)))

if start_time < 0:
  start_time = 0

if end_time > int(round(len(similarities)/fps)):
  end_time = int(round(len(similarities)/fps))

print(start_time)
print(end_time)

186
30.0
3
9


In [12]:
from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_subclip
ffmpeg_extract_subclip(video, start_time, end_time, targetname=video_clip_result)


[MoviePy] Running:
>>> /usr/local/lib/python3.7/dist-packages/imageio_ffmpeg/binaries/ffmpeg-linux64-v4.2.2 -y -i /content/car_crash.mp4 -ss 3.00 -t 6.00 -vcodec copy -acodec copy /content/car_crash_clip.mp4
... command successful.


In [13]:
from IPython.display import HTML
from base64 import b64encode
mp4 = open(video_clip_result,'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>pip install imageio-ffmpeg
      <source src="%s" type="video/mp4">
</video>
""" % data_url)