# Video Search

## Step 1: Video library (10 points)

My Python API for downloading a video and its closed captions from YouTube is defined in my downloader.py module. To use it:
- import the module
- call the download method within the module

This will result in the video and closed captions being downloaded to a directory called "downloads" in your working directory as an mp4 and text file.

In [24]:
# uses API to download videos and their closed captions
import downloader
urls = ["https://www.youtube.com/watch?v=wbWRWeVe1XE", "https://www.youtube.com/watch?v=FlJoBhLnqko", "https://www.youtube.com/watch?v=Y-bVwPRy_no"]
for url in urls:
    downloader.download(url)

## Step 2: Video indexing pipeline (90 points)

### 2.1 Preprocess the video (15 points)

In [25]:
import cv2
import os
# function to handle preprocessing steps
def preprocess(filepath):
    if not os.path.exists("frames"):
        os.makedirs("frames")
    video = cv2.VideoCapture(filepath)
    videoId = filepath.split("/")[-1].split(".")[0]
    extractedFrames = 10
    frameCount = 0
    totalFrames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    skipFactor = int(totalFrames / extractedFrames)
    # decoding frames
    while True:
        success, frame = video.read()
        if not success: break
        frameCount += 1
        # sampling frames
        if frameCount % skipFactor != 0:
            continue
        # resizing frames
        frame = cv2.resize(frame, (800, 800))
        frameNum = frameCount // skipFactor
        timestamp = video.get(cv2.CAP_PROP_POS_MSEC)
        cv2.imwrite("frames/" + videoId + "#" + str(frameNum) + "#" + str(timestamp) + ".jpg", frame)
    video.release()
preprocess("downloads/wbWRWeVe1XE.mp4")
preprocess("downloads/Y-bVwPRy_no.mp4")
preprocess("downloads/FlJoBhLnqko.mp4")

Note that preprocessing step of normalizing is done in the next section.

### 2.2 Detecting objects (25 points)

I found a pretrained object-detector to detect objects belonging to MS COCO classes from the [this PyTorch documentation page](https://pytorch.org/vision/main/models/generated/torchvision.models.detection.retinanet_resnet50_fpn.html#torchvision.models.detection.RetinaNet_ResNet50_FPN_Weights). I chose this as it can be accessed via a simple import statement as opposed to having to download the model to my working directory.

In [26]:
from torchvision.models.detection import retinanet_resnet50_fpn
model = retinanet_resnet50_fpn(weights='COCO_V1')
model.eval()

RetinaNet(
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (1): FrozenBatchNorm2d(256, eps=0.0)


In [27]:
import torch
import torchvision.transforms as T
from PIL import Image
# transforming data to conform to model specifications
transform = T.Compose([
    T.ToTensor(),
    # normalization
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# generating bounding boxes for videos
predictions = []
image_paths = ["frames/" + filename for filename in os.listdir("frames")]
for path in image_paths:
    image = Image.open(path)
    transformed_image = transform(image).unsqueeze(0)
    with torch.no_grad():
        predictions.append(model(transformed_image))

In [28]:
id_to_class = {
    1: 'person',
    2: 'bicycle',
    3: 'car',
    4: 'motorbike',
    5: 'aeroplane',
    6: 'bus',
    7: 'train',
    8: 'truck',
    9: 'boat',
    10: 'traffic light',
    11: 'fire hydrant',
    12: 'street sign',
    13: 'stop sign',
    14: 'parking meter',
    15: 'bench',
    16: 'bird',
    17: 'cat',
    18: 'dog',
    19: 'horse',
    20: 'sheep',
    21: 'cow',
    22: 'elephant',
    23: 'bear',
    24: 'zebra',
    25: 'giraffe',
    26: 'hat',
    27: 'backpack',
    28: 'umbrella',
    29: 'shoe',
    30: 'eyeglasses',
    31: 'handbag',
    32: 'tie',
    33: 'suitcase',
    34: 'frisbee',
    35: 'skis',
    36: 'snowboard',
    37: 'sports ball',
    38: 'kite',
    39: 'baseball bat',
    40: 'baseball glove',
    41: 'skateboard',
    42: 'surfboard',
    43: 'tennis racket',
    44: 'bottle',
    45: 'plate',
    46: 'wine glass',
    47: 'cup',
    48: 'fork',
    49: 'knife',
    50: 'spoon',
    51: 'bowl',
    52: 'banana',
    53: 'apple',
    54: 'sandwich',
    55: 'orange',
    56: 'broccoli',
    57: 'carrot',
    58: 'hot dog',
    59: 'pizza',
    60: 'donut',
    61: 'cake',
    62: 'chair',
    63: 'sofa',
    64: 'potted plant',
    65: 'bed',
    66: 'mirror',
    67: 'dining table',
    68: 'window',
    69: 'desk',
    70: 'toilet',
    71: 'door',
    72: 'tv monitor',
    73: 'laptop',
    74: 'mouse',
    75: 'remote',
    76: 'keyboard',
    77: 'cell phone',
    78: 'microwave',
    79: 'oven',
    80: 'toaster',
    81: 'sink',
    82: 'refrigerator',
    83: 'blender',
    84: 'book',
    85: 'clock',
    86: 'vase',
    87: 'scissors',
    88: 'teddy bear',
    89: 'hair drier',
    90: 'toothbrush',
    91: 'hairbrush'
}

In [29]:
# creating tabular structure for data
table = []
for i, prediction in enumerate(predictions):
    d = prediction[0]
    bounding_boxes = d["boxes"]
    scores = d["scores"]
    labels = d["labels"]
    videoTitle = image_paths[i].split("/")[1].split(".")[0]
    try:
        videoId, frameNum, timestamp = videoTitle.split("#")
    except:
        print(videoTitle)
    for j, box in enumerate(bounding_boxes):
        detectedObjId = int(labels[j])
        detectedObjClass = id_to_class[detectedObjId]
        confidence = float(scores[j])
        bbox_info = bounding_boxes[j]
        table.append([videoId, frameNum, timestamp, detectedObjId, detectedObjClass, confidence, bbox_info])

In [36]:
import pandas as pd
df = pd.DataFrame(table, columns=['videoId', 'frameNum', 'timestamp', 'detectedObjId', 'detectedObjClass', 'confidence', 'bbox_info'])
print(df.head())

       videoId frameNum timestamp  detectedObjId detectedObjClass  confidence  \
0  FlJoBhLnqko        1     22856             28         umbrella    0.230096   
1  FlJoBhLnqko        1     22856              9             boat    0.209400   
2  FlJoBhLnqko        1     22856             28         umbrella    0.205693   
3  FlJoBhLnqko        1     22856             64     potted plant    0.168514   
4  FlJoBhLnqko        1     22856             62            chair    0.163283   

                                           bbox_info  
0  [tensor(15.5198), tensor(474.2111), tensor(117...  
1  [tensor(12.3550), tensor(471.3804), tensor(117...  
2  [tensor(263.1909), tensor(539.4532), tensor(35...  
3  [tensor(401.7110), tensor(506.4498), tensor(48...  
4  [tensor(703.0315), tensor(449.7769), tensor(79...  


In [30]:
def extract_subimages(image, bounding_boxes):
    subimages = []
    for bbox in bounding_boxes:
        x, y, w, h = map(int, bbox)
        subimage = image[y:y+h, x:x+w]
        subimages.append(subimage)
    return subimages

In [31]:
if not os.path.exists("subimages"):
        os.makedirs("subimages")
for i, prediction in enumerate(predictions):
    d = prediction[0]
    bounding_boxes = d["boxes"]
    scores = d["scores"]
    labels = d["labels"]
    subimages = extract_subimages(cv2.imread(image_paths[i]), bounding_boxes)
    for j, subimage in enumerate(subimages):
        print(scores[j])
        title = "subimages/" + str(i) + "-" + str(j) + "-" + str(scores[j]) + "-" + str(labels[j]) + ".jpg"
        cv2.imwrite(title, subimage)

tensor(0.2301)
tensor(0.2094)
tensor(0.2057)
tensor(0.1685)
tensor(0.1633)
tensor(0.1525)
tensor(0.1467)
tensor(0.1453)
tensor(0.1452)
tensor(0.1430)
tensor(0.1404)
tensor(0.1404)
tensor(0.1353)
tensor(0.1300)
tensor(0.1211)
tensor(0.1166)
tensor(0.1161)
tensor(0.1159)
tensor(0.1151)
tensor(0.1125)
tensor(0.1116)
tensor(0.1098)
tensor(0.1091)
tensor(0.1087)
tensor(0.1078)
tensor(0.1047)
tensor(0.1040)
tensor(0.1037)
tensor(0.1031)
tensor(0.1005)
tensor(0.1002)
tensor(0.0958)
tensor(0.0945)
tensor(0.0919)
tensor(0.0904)
tensor(0.0902)
tensor(0.0889)
tensor(0.0878)
tensor(0.0875)
tensor(0.0869)
tensor(0.0866)
tensor(0.0865)
tensor(0.0863)
tensor(0.0861)
tensor(0.0860)
tensor(0.0845)
tensor(0.0845)
tensor(0.0844)
tensor(0.0824)
tensor(0.0804)
tensor(0.0804)
tensor(0.0799)
tensor(0.0797)
tensor(0.0795)
tensor(0.0792)
tensor(0.0785)
tensor(0.0772)
tensor(0.0762)
tensor(0.0760)
tensor(0.0753)
tensor(0.0750)
tensor(0.0750)
tensor(0.0747)
tensor(0.0737)
tensor(0.0732)
tensor(0.0728)
tensor(0.0

tensor(0.0790)
tensor(0.0788)
tensor(0.0787)
tensor(0.0782)
tensor(0.0775)
tensor(0.0765)
tensor(0.0751)
tensor(0.0750)
tensor(0.0734)
tensor(0.0720)
tensor(0.0719)
tensor(0.0717)
tensor(0.0717)
tensor(0.0710)
tensor(0.0704)
tensor(0.0697)
tensor(0.0691)
tensor(0.0691)
tensor(0.0690)
tensor(0.0687)
tensor(0.0687)
tensor(0.0687)
tensor(0.0683)
tensor(0.0679)
tensor(0.0676)
tensor(0.0673)
tensor(0.0671)
tensor(0.0667)
tensor(0.0664)
tensor(0.0659)
tensor(0.0657)
tensor(0.0649)
tensor(0.0641)
tensor(0.0641)
tensor(0.0638)
tensor(0.0636)
tensor(0.0636)
tensor(0.0634)
tensor(0.0634)
tensor(0.0633)
tensor(0.0631)
tensor(0.0630)
tensor(0.0624)
tensor(0.0624)
tensor(0.0622)
tensor(0.0618)
tensor(0.0611)
tensor(0.0606)
tensor(0.0604)
tensor(0.0600)
tensor(0.0600)
tensor(0.0596)
tensor(0.0591)
tensor(0.0589)
tensor(0.0588)
tensor(0.0585)
tensor(0.0584)
tensor(0.0583)
tensor(0.0582)
tensor(0.0579)
tensor(0.0575)
tensor(0.0574)
tensor(0.0569)
tensor(0.0569)
tensor(0.0564)
tensor(0.0562)
tensor(0.0

tensor(0.0536)
tensor(0.0534)
tensor(0.0533)
tensor(0.0531)
tensor(0.0527)
tensor(0.0526)
tensor(0.0526)
tensor(0.0525)
tensor(0.0524)
tensor(0.0522)
tensor(0.0521)
tensor(0.0519)
tensor(0.0518)
tensor(0.0516)
tensor(0.0513)
tensor(0.0506)
tensor(0.0506)
tensor(0.0502)
tensor(0.0502)
tensor(0.0501)
tensor(0.0501)
tensor(0.0500)
tensor(0.4959)
tensor(0.3217)
tensor(0.2857)
tensor(0.1951)
tensor(0.1822)
tensor(0.1777)
tensor(0.1749)
tensor(0.1724)
tensor(0.1713)
tensor(0.1677)
tensor(0.1677)
tensor(0.1644)
tensor(0.1640)
tensor(0.1565)
tensor(0.1536)
tensor(0.1476)
tensor(0.1435)
tensor(0.1336)
tensor(0.1295)
tensor(0.1271)
tensor(0.1258)
tensor(0.1256)
tensor(0.1245)
tensor(0.1236)
tensor(0.1233)
tensor(0.1219)
tensor(0.1198)
tensor(0.1191)
tensor(0.1188)
tensor(0.1158)
tensor(0.1146)
tensor(0.1106)
tensor(0.1095)
tensor(0.1089)
tensor(0.1087)
tensor(0.1083)
tensor(0.1057)
tensor(0.1042)
tensor(0.1035)
tensor(0.1002)
tensor(0.0978)
tensor(0.0975)
tensor(0.0975)
tensor(0.0970)
tensor(0.0

tensor(0.0754)
tensor(0.0748)
tensor(0.0735)
tensor(0.0729)
tensor(0.0727)
tensor(0.0725)
tensor(0.0724)
tensor(0.0721)
tensor(0.0712)
tensor(0.0708)
tensor(0.0706)
tensor(0.0705)
tensor(0.0703)
tensor(0.0703)
tensor(0.0672)
tensor(0.0668)
tensor(0.0664)
tensor(0.0661)
tensor(0.0660)
tensor(0.0657)
tensor(0.0656)
tensor(0.0651)
tensor(0.0647)
tensor(0.0641)
tensor(0.0641)
tensor(0.0641)
tensor(0.0640)
tensor(0.0635)
tensor(0.0632)
tensor(0.0629)
tensor(0.0626)
tensor(0.0620)
tensor(0.0616)
tensor(0.0610)
tensor(0.0607)
tensor(0.0604)
tensor(0.0604)
tensor(0.0599)
tensor(0.0599)
tensor(0.0595)
tensor(0.0590)
tensor(0.0585)
tensor(0.0574)
tensor(0.0570)
tensor(0.0568)
tensor(0.0567)
tensor(0.0561)
tensor(0.0560)
tensor(0.0557)
tensor(0.0556)
tensor(0.0555)
tensor(0.0552)
tensor(0.0551)
tensor(0.0540)
tensor(0.0539)
tensor(0.0537)
tensor(0.0534)
tensor(0.0532)
tensor(0.0531)
tensor(0.0528)
tensor(0.0527)
tensor(0.0522)
tensor(0.0522)
tensor(0.0519)
tensor(0.0507)
tensor(0.0506)
tensor(0.0

tensor(0.0790)
tensor(0.0787)
tensor(0.0783)
tensor(0.0778)
tensor(0.0775)
tensor(0.0773)
tensor(0.0772)
tensor(0.0772)
tensor(0.0771)
tensor(0.0768)
tensor(0.0767)
tensor(0.0765)
tensor(0.0764)
tensor(0.0764)
tensor(0.0762)
tensor(0.0761)
tensor(0.0759)
tensor(0.0757)
tensor(0.0755)
tensor(0.0753)
tensor(0.0748)
tensor(0.0736)
tensor(0.0734)
tensor(0.0724)
tensor(0.0722)
tensor(0.0718)
tensor(0.0712)
tensor(0.0710)
tensor(0.0707)
tensor(0.0707)
tensor(0.0702)
tensor(0.0702)
tensor(0.0701)
tensor(0.0700)
tensor(0.0700)
tensor(0.0697)
tensor(0.0697)
tensor(0.0697)
tensor(0.0688)
tensor(0.0685)
tensor(0.0680)
tensor(0.0679)
tensor(0.0676)
tensor(0.0674)
tensor(0.0674)
tensor(0.0673)
tensor(0.0670)
tensor(0.0667)
tensor(0.0666)
tensor(0.0663)
tensor(0.0661)
tensor(0.0661)
tensor(0.0657)
tensor(0.0653)
tensor(0.0653)
tensor(0.0652)
tensor(0.0646)
tensor(0.0643)
tensor(0.0642)
tensor(0.0639)
tensor(0.0638)
tensor(0.0637)
tensor(0.0633)
tensor(0.0631)
tensor(0.0630)
tensor(0.0629)
tensor(0.0

tensor(0.0551)
tensor(0.0549)
tensor(0.0547)
tensor(0.0547)
tensor(0.0546)
tensor(0.0546)
tensor(0.0546)
tensor(0.0544)
tensor(0.0541)
tensor(0.0540)
tensor(0.0538)
tensor(0.0533)
tensor(0.0533)
tensor(0.0533)
tensor(0.0524)
tensor(0.0524)
tensor(0.0522)
tensor(0.0520)
tensor(0.0519)
tensor(0.0514)
tensor(0.0513)
tensor(0.0513)
tensor(0.0510)
tensor(0.0510)
tensor(0.0507)
tensor(0.0505)
tensor(0.0502)


## 2.3 Embedding model (30 points)

In [32]:
data = []
for image in os.listdir("subimages"):
    inp = cv2.resize(cv2.imread("subimages/" + image), (28, 28))
    data.append(inp)

In [33]:
import numpy as np
from sklearn.model_selection import train_test_split
x_train = np.array(data) / 255.0

In [34]:
from keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model

# Encoder
input_img = Input(shape=(28, 28, 3))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)

# Decoder
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(3, (3, 3), activation='sigmoid', padding='same')(x)

# Autoencoder
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

In [38]:
history = autoencoder.fit(x_train, x_train,
                           epochs=50,
                           batch_size=32,
                           shuffle=True,
                           validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
