<a href="https://colab.research.google.com/github/ysurs/Fun_projects/blob/main/Coding_challenge_clip_a_video.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installing dependencies and libraries

In [None]:
# pytube is used to download videos from YouTube
!pip install pytube

# Intall a newer version of plotly
!pip install plotly==4.14.3

# Install CLIP from the GitHub repo
!pip install git+https://github.com/openai/CLIP.git

# Install torch 1.7.1 with GPU support
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-ey71nq21
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-ey71nq21
  Resolved https://github.com/openai/CLIP.git to commit a9b1bf5920416aaeaec965c25dd9e8f98c864f16
  Preparing metadata (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [None]:
from pytube import YouTube
import cv2
from PIL import Image
import clip
import torch
import math
import numpy as np
import pandas as pd
import plotly.express as px
import datetime
from IPython.core.display import HTML
import time
import os

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Clip a video api

In [None]:
class clip_a_video:


  def __init__(self,dataset_path,video_download_path,nframes_skip):
    self.dataset=pd.read_csv(dataset_path)[:8]
    self.video_download_path=video_download_path
    self.nframes_skip=nframes_skip

  
  def get_video_frames(self,video_path):
    
    video_id=YouTube(video_path).video_id
    streams = YouTube(video_path).streams.filter(adaptive=True, subtype="mp4", resolution="360p", only_video=True)
    #print(video_id)
    
    if len(streams) == 0:
      raise "No suitable stream found for this YouTube video!"

    file_name = f"{video_id}.mp4"
    file_path = os.path.join(self.video_download_path, file_name)
    streams[0].download(filename=file_name, output_path=self.video_download_path)
  

    video_frames = []

    capture = cv2.VideoCapture(file_path)
    fps = capture.get(cv2.CAP_PROP_FPS)

    current_frame = 0
    while capture.isOpened():

      ret, frame = capture.read()
      
      
      if ret == True:
        video_frames.append(Image.fromarray(frame[:, :, ::-1]))
      else:
        break

      
      current_frame += self.nframes_skip
      capture.set(cv2.CAP_PROP_POS_FRAMES, current_frame)
      
    capture.release()
    return video_frames



  
  def get_video_features(self,vid_frames,model,preprocess,device):
    
    #print("in features function")
    batch_size = 256
    batches = math.ceil(len(vid_frames) / batch_size)

    
    video_features = torch.empty([0, 512], dtype=torch.float16).to(device)

    
    for i in range(batches):
      

      
      batch_frames = vid_frames[i*batch_size : (i+1)*batch_size]
      #print(len(batch_frames))
      
      
      batch_preprocessed = torch.stack([preprocess(frame) for frame in batch_frames]).to(device)
      #print(batch_preprocessed.shape)
      
    
      with torch.no_grad():
        batch_features = model.encode_image(batch_preprocessed)
        #print(batch_features.shape)
        batch_features /= batch_features.norm(dim=-1, keepdim=True)
        #print(batch_features.shape)
      
      video_features = torch.cat((video_features, batch_features))

    return video_features
  
  
  
  def check_query_in_frames(self,query,video_features,model_for_task,preprocess_pipeline,device,display_results_count=3):

    #print("in core function")
    
    with torch.no_grad():
      text_features = model_for_task.encode_text(clip.tokenize(query).to(device))
      text_features /= text_features.norm(dim=-1, keepdim=True)
      #print(text_features.shape)

    # Compute the similarity between the search query and each frame using the Cosine similarity
    similarities = (100.0 * video_features @ text_features.T)
    values, best_photo_idx = similarities.topk(display_results_count, dim=0)
    #print(values,best_photo_idx)
    return best_photo_idx[values>28.0].numel()
    #print(values,best_photo_idx)
  
  
  
  
  
  def is_query_in_video(self,youtube_video,query):
    
    
    full_video_path="https://youtube.com/"+youtube_video
    #print(full_video_path)
    current_video_frames=self.get_video_frames(full_video_path)

    device = "cuda" if torch.cuda.is_available() else "cpu"
    model_for_task, preprocess_pipeline = clip.load("ViT-B/32", device=device)

    video_features=self.get_video_features(current_video_frames,model_for_task,preprocess_pipeline,device)
  
    no_of_related_frames=self.check_query_in_frames(query,video_features,model_for_task,preprocess_pipeline,device)
  
    return no_of_related_frames
  
  
  
  
  
  
  
  
  
  
  def find_videos_with_query(self,search_text):

    videos_index=[]
    
    for i in range(len(self.dataset)):
        print(i)
        start_time = time.time()
        if(self.is_query_in_video(self.dataset['Videourl'][i],search_text)):
            videos_index.append(i)
            print(i)
        end_time = time.time()
        total_time=end_time-start_time
        #print(f"time taken to process 1 video {total_time}")

    return self.dataset.loc[videos_index]
  


  



*****

### Using top 2 rows

####  The search query was "a cow" and out of the two, the class returned first video. Manually verified that a cow is present in the first video


Note: For clarity, I added print statements to my functions for this implementation

In [None]:
tet_object=clip_a_video('/content/drive/MyDrive/Copy of Youtube_Video_Dataset.csv','/content/downloaded_videos',60)

In [None]:
tet_object.dataset

Unnamed: 0,Title,Videourl,Category,Description
0,Madagascar Street Food!!! Super RARE Malagasy ...,/watch?v=EwBA1fOQ96c,Food,🎥GIANT ALIEN SNAIL IN JAPAN! » https://youtu.b...
1,42 Foods You Need To Eat Before You Die,/watch?v=0SPwwpruGIA,Food,This is the ultimate must-try food bucket list...


In [None]:

tet_object.find_videos_with_query("a cow")

0
https://youtube.com//watch?v=EwBA1fOQ96c
EwBA1fOQ96c
in features function
in core function
tensor([[28.9985],
        [28.6546],
        [28.4272]]) tensor([[36],
        [38],
        [39]])
0
time taken to process 1 video 266.762371301651
1
https://youtube.com//watch?v=0SPwwpruGIA
0SPwwpruGIA
in features function
in core function
tensor([[26.3702],
        [25.4001],
        [25.1213]]) tensor([[128],
        [  0],
        [ 48]])
time taken to process 1 video 95.22529458999634


Unnamed: 0,Title,Videourl,Category,Description
0,Madagascar Street Food!!! Super RARE Malagasy ...,/watch?v=EwBA1fOQ96c,Food,🎥GIANT ALIEN SNAIL IN JAPAN! » https://youtu.b...


Points to be noted:

* The hyperparameter in this case is how many frames to skip.
* I have set nframe_to_skip to be 60. It is possible that the object representing the query we have passed is not present in any of the frames.
* In the above example, the video which actually had a cow in it had the max cosine similarity value as 28.9985. It is possible that which selecting frames from the video, I might have skipped that frame which actually had a cow.
* In such cases, setting a threshold (a value of cosine similarity above which we can assume that the object in "query" is actually present in the video) was a challenge.

****

****

### Using top 7 rows 

In [None]:
tet_object_8=clip_a_video('/content/drive/MyDrive/Copy of Youtube_Video_Dataset.csv','/content/downloaded_videos_8',60)

In [None]:
tet_object_8.dataset.head(8)

Unnamed: 0,Title,Videourl,Category,Description
0,Madagascar Street Food!!! Super RARE Malagasy ...,/watch?v=EwBA1fOQ96c,Food,🎥GIANT ALIEN SNAIL IN JAPAN! » https://youtu.b...
1,42 Foods You Need To Eat Before You Die,/watch?v=0SPwwpruGIA,Food,This is the ultimate must-try food bucket list...
2,Gordon Ramsay’s Top 5 Indian Dishes,/watch?v=upfu5nQB2ks,Food,We found 5 of the best and most interesting In...
3,How To Use Chopsticks - In About A Minute 🍜,/watch?v=xFRzzSF_6gk,Food,You're most likely sitting in a restaurant wit...
4,Trying Indian Food 1st Time!,/watch?v=K79bXtaRwcM,Food,HELP SUPPORT SINSTV!! Shop Our Sponsors!\nLast...
5,Blippi Tours the Chocolate Factory | Learn abo...,/watch?v=uSIb-Wbyx6Y,Food,After Blippi eats his vegetables Blippi takes ...
6,EGYPT: Vegetarian food | Mobile Sim | Indian S...,/watch?v=Gozaqmg6hmk,Food,"http://bit.ly/subscribeMT\nIn this video, you ..."
7,Chinese Street Food Liuhe Tourist Night Market,/watch?v=H0xKYgUX3zI,Food,Trying many different kinds of chinese street ...


In [None]:
# Final dataframe which contains all the videos which have frame related to query word or the word itself.
tet_object_8.find_videos_with_query("chopsticks")

0
0
1
2
2
3
3
4
4
5
6
7
7


Unnamed: 0,Title,Videourl,Category,Description
0,Madagascar Street Food!!! Super RARE Malagasy ...,/watch?v=EwBA1fOQ96c,Food,🎥GIANT ALIEN SNAIL IN JAPAN! » https://youtu.b...
2,Gordon Ramsay’s Top 5 Indian Dishes,/watch?v=upfu5nQB2ks,Food,We found 5 of the best and most interesting In...
3,How To Use Chopsticks - In About A Minute 🍜,/watch?v=xFRzzSF_6gk,Food,You're most likely sitting in a restaurant wit...
4,Trying Indian Food 1st Time!,/watch?v=K79bXtaRwcM,Food,HELP SUPPORT SINSTV!! Shop Our Sponsors!\nLast...
7,Chinese Street Food Liuhe Tourist Night Market,/watch?v=H0xKYgUX3zI,Food,Trying many different kinds of chinese street ...


In [None]:
tet_object_8.dataset['Description'][0]

"🎥GIANT ALIEN SNAIL IN JAPAN! » https://youtu.be/5jcgwu-GvM4\n🇲🇬GO ON YOUR TOUR OF MADAGASCAR! » http://bit.ly/Ramartour\n👕GET YOUR BEST EVER MERCH! »  http://bit.ly/BEFRSMerch\n💗SUPPORT OUR MISSION » http://bit.ly/BestEverPatreon\n🛵Learn more about Onetrip’s Vietnam tours » http://bit.ly/BEFRSOnetrip\n\nSpecial thanks to Joel and Ramartour Madagascar for helping us capture the undiscovered parts of Madagascar. Go on your own tour of Madagascar with Ramartour: http://bit.ly/Ramartour and follow them on Instagram for more Madagascar info: @ramartour_madagascar\n- - - - - - - - - - - - - - - - -\n» THE TRADITIONAL FOOD IN MADAGASCAR\n\n1. TSENA’NY TANTSAHA TALATAMATY: Rice Porridge + Chinese Noodle Soup\nADDRESS: Talatamaty City\nOPEN: 3AM - 7AM (or whenever the police shut it down)\n\nFour miles outside of Madagascar’s capital city, Antananarivo, lies Talatamaty City. Talatamaty City is home to one of the country’s largest local morning markets, Tsena’ny Tantsaha. Sellers open up shop a

Note: Description contains important information about the contents of this video

## Q. How are you indexing videos? How long does it take?

Ans: 

Steps:

1. For each youtube video, we select the mode of the streams, the quality and for this application, we select only_video stream.

2. The resulting stream is downloaded and we get a video in .mp4 format.

3. We specify the number of frames to be skipped.

4. We extract frames and store those frames in a list. While reading using cv2.VideoCapture(file_path).read(), frames are returned in BGR format so while using Image.fromarray() we need to reverse the order of channels.

5. Each frame in a batch is preprocessed by using the preprocess function from clip.

6. All the frames in a batch are encoded and normalised. So in this case, each frame is represented by an array of length 512.

7. After concatenating all such encodings for frames in a video, we get a video_features tensor


###How long does it take ?

It depends on the size of video. For example, processing the first video in our dataset took 266.76 seconds

## Q. How does inference work? How performant is v1.

Ans:

Steps:

1. After getting, video_features tensor, we also get encoding result of the query which we have passed which is of shape [1,512]. We normalize these encodings.

2. We obtain similarity between the query and frames in our video by performing **matrix muliplication**. Lets say our video features tensor is of shape [89,512], i.e a batch with 89 frames and each frame is represented by a 512 length tensor.
The shape of query encoding is [1,512]

matrix multiplication= [89,512]@[1,512].T

Mulitplying the result with 100 gives **similarity scores** and we select top 3 of the frames.

3. After we get our top 3 similarity scores and corresponding frames, we put a **threshold**. In this case, I have put 28. We will select all values greater than 28 (**trial and error**). **If we get some frames with similarity values greater than 28 then we conclude that query is present in the video else not.**

##Q. How to improve:

1. One way will be to not skip frames which processing videos. In this case, the threshold will be much higher i.e nearly 70 to 80 percent or more and we can get accurate results.

2. Before processing all the videos, we can preprocess the description of each video. Encode and normalize the the description. We can then find the cosine similarity between the encoding of description and the query. If the similarity is beyond a threshold (we will set it experimentatlly) then and only then we will download that particular video. This will save us from unnecessary downloads

##Q. How performant is v1 ?

Ans Lets say to process a video and and find similarity between query word and frames from the video, time taken is T. In the current implementation, total time will be L*T where L is the total length of the dataset.


If we consider the second method in improvements, then the time taken will be l*T where l is the number of videos which have more chances of being related to the query word.