**Assignment 3**

**By Roll Number 2023702013**



Q1: Face detection and association-based tracking [4.5 points]

1. Data preparation. We will implement face detection and tracking on a famous scene from the movie Forrest Gump. To prepare the dataset, please download the video clip from https://www.youtube.com/ watch?v=bSMxl1V8FSg (the mp4 at 480p resolution) and burst the first 30 seconds into frames (you should get about 719-720 frames).
Hint 1: https://github.com/ytdl-org/youtube-dl is a great tool to download Youtube videos. Use -F flag to identify which format to download.
Hint 2: ffmpeg is a wonderful tool to burst the video into frames. But you may also use decord or other libraries for video manipulation (be wary of different frame rates!).


In [None]:
file_path='Moviecliop.mp4'


In [None]:
!ffmpeg

In [3]:
#import the frames

import cv2
import numpy as np
import matplotlib.pyplot as plt
import os
import subprocess

frames = []
video_file='data/Movieclip.mp4'

desired_frames = 720
total_duration = 30  # seconds
fps = desired_frames / total_duration
output_folder='data/frames'

if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Command to trim the first 30 seconds of the video
trim_command = [
    "ffmpeg",
    "-i", video_file,
    "-t", "00:00:30",  # Duration of the output clip
    "-c:v", "libx264",  # Video codec to be used
    "-c:a", "aac",  # Audio codec to be used (if you want to keep the audio)
    "trimmed_video.mp4"
]
subprocess.run(trim_command)

# Command to extract frames at the calculated fps
frame_extraction_command = [
    "ffmpeg",
    "-i", "trimmed_video.mp4",  # Input is the trimmed video
    "-vf", f"fps={fps}",
    os.path.join(output_folder, "frame_%04d.jpg")
]

subprocess.run(frame_extraction_command)
num_frames = len(os.listdir(output_folder))
print("Number of frames:", num_frames)




ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 11.3.0 (conda-forge gcc 11.3.0-19)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --disable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libfontconfig --enable-libopenh264 --enable-gnu

2. [1.5 points] Face detection. Use the Viola-Jones Haar cascades based face detector from OpenCV to detect faces in each frame. How long does it take to process each frame? Identify some key factors of the algorithm that could change the time.
Hint: you may need to look within the xml config file.

In [19]:
import cv2

cascade_path = cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
face_cascade = cv2.CascadeClassifier(cascade_path)

In [20]:
def detect_faces(frame):
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
    return faces


In [26]:
import time

start_time = time.time()
for frame in frames:
    faces = detect_faces(frame)
    for (x, y, w, h) in faces:
        cv2.rectangle(frame, (x, y), (x+w, y+h), (255, 0, 0), 2)
end_time = time.time()
print("Processing time: {:.2f} ms".format((end_time - start_time) * 1000))





Processing time: 0.11 ms


In [None]:
frames_folder='data/frames'
tot_time=0
for i,frame_file in enumerate(sorted(os.listdir(frames_folder))):
    frame_path = os.path.join(frames_folder, frame_file)
    frame = cv2.imread(frame_path)

    # Convert the frame to grayscale for face detection
    gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Perform face detection
    start_time = time.time()
    faces = face_cascade.detectMultiScale(gray_frame, scaleFactor=1.1, minNeighbors=15, minSize=(30, 30))
    end_time = time.time()

    # Draw rectangles around the detected faces
    for (x, y, w, h) in faces:
        cv2.rectangle(frame, (x, y), (x + w, y + h), (255, 0, 0), 2)

    # Calculate and print the time taken to process each frame
    processing_time = end_time - start_time
    print(f"Time taken to process frame {frame_file}: {processing_time} seconds")

    # Write the processed frame to the output directory
    output_filename = f'data/output/frame_{i}.png'
    cv2.imwrite(output_filename, frame)

    tot_time += processing_time

print('Total time taken:', tot_time)

Time taken to process frame frame_0001.jpg: 0.4403364658355713 seconds
Time taken to process frame frame_0002.jpg: 0.6025142669677734 seconds
Time taken to process frame frame_0003.jpg: 0.7894551753997803 seconds
Time taken to process frame frame_0004.jpg: 0.7572314739227295 seconds
Time taken to process frame frame_0005.jpg: 0.7073864936828613 seconds
Time taken to process frame frame_0006.jpg: 0.433211088180542 seconds
Time taken to process frame frame_0007.jpg: 0.42881155014038086 seconds
Time taken to process frame frame_0008.jpg: 0.43480801582336426 seconds
Time taken to process frame frame_0009.jpg: 0.42884230613708496 seconds
Time taken to process frame frame_0010.jpg: 0.43456578254699707 seconds
Time taken to process frame frame_0011.jpg: 0.41727519035339355 seconds
Time taken to process frame frame_0012.jpg: 0.4105961322784424 seconds
Time taken to process frame frame_0013.jpg: 0.41927218437194824 seconds
Time taken to process frame frame_0014.jpg: 0.410489559173584 seconds
Ti

3. Face detection visualization. Visualize the face detections made over the first 30s frames as a new video. Link to the video from your google drive. Watch the video and draw three conclusions about when does the face detector work or fail. Why do you think this is the case?
Hint: You can use cv2.rectangle to draw boxes on the image and then save them back to disk. Then ffmpeg can be used again to stitch together the frames into a new video.

In [29]:
!ffmpeg -framerate 24 -i processed_frames/frame_%03d.jpg -c:v libx264 -profile:v high -crf 20 -pix_fmt yuv420p output_video.mp4


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

4. [1.5 point] Association-based tracking. Tracking can be used to associate face detections across time and understand that it is the same character appearing across multiple frames of the movie. We will explore a simple way to perform tracking.
(i) Generate face tracks by comparing face detections in two consecutive frames and associating them based on IoU scores. You may want to associate faces only when IoU > 0.5. Do consider what happens when there are multiple face detections in both frames. Start new tracks for faces not seen in the previous frame. End existing tracks when faces are not visible in the next frame. How many unique tracks did you create in the first 30 seconds?
(ii) Update the video visualization above to now include a unique track identifier (an integer number is fine), shown inside each box. Link to the video from your google drive.
Hint: You may use cv2.putText to write these numbers. Make sure they are readable after stitching together the frames into a video.
(iii) Comment about the quality of the face tracks. Do different people get associated in one track? Is a unique character associated with one unique track id? Note the timestamps of some failure cases and explain why.