**Assignment 3**

**By Roll Number 2023702013**



Q1: Face detection and association-based tracking [4.5 points]

1. Data preparation. We will implement face detection and tracking on a famous scene from the movie Forrest Gump. To prepare the dataset, please download the video clip from https://www.youtube.com/ watch?v=bSMxl1V8FSg (the mp4 at 480p resolution) and burst the first 30 seconds into frames (you should get about 719-720 frames).
Hint 1: https://github.com/ytdl-org/youtube-dl is a great tool to download Youtube videos. Use -F flag to identify which format to download.
Hint 2: ffmpeg is a wonderful tool to burst the video into frames. But you may also use decord or other libraries for video manipulation (be wary of different frame rates!).


In [None]:
!ffmpeg

In [1]:
#import the frames

import cv2
import numpy as np
import matplotlib.pyplot as plt
import os
import subprocess

frames = []
video_file='data/Movieclip.mp4'

desired_frames = 720
total_duration = 30  
fps = desired_frames / total_duration
output_folder='data/frames'


os.makedirs(output_folder, exist_ok=True)

# Command to trim the first 30 seconds of the video
trim_command = [
    "ffmpeg",
    "-i", video_file,
    "-t", "00:00:30", 
    "-c:v", "libx264", 
    "-c:a", "aac",  
    "data/trimmed_video.mp4"
]
subprocess.run(trim_command)


frame_extraction_command = [
    "ffmpeg",
    "-i", "data/trimmed_video.mp4", 
    "-vf", f"fps={fps}",
    os.path.join(output_folder, "frame_%04d.jpg")
]

subprocess.run(frame_extraction_command)
num_frames = len(os.listdir(output_folder))
print("Number of frames:", num_frames)




ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 11.3.0 (conda-forge gcc 11.3.0-19)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --disable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libfontconfig --enable-libopenh264 --enable-gnu

Number of frames: 721


frame=  721 fps=432 q=24.8 Lsize=N/A time=00:00:30.04 bitrate=N/A speed=  18x    
video:17790kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown


2. [1.5 points] Face detection. Use the Viola-Jones Haar cascades based face detector from OpenCV to detect faces in each frame. How long does it take to process each frame? Identify some key factors of the algorithm that could change the time.
Hint: you may need to look within the xml config file.

In [2]:
import cv2

cascade_path = 'haarcascade_frontalface_default.xml'
face_cascade = cv2.CascadeClassifier(cascade_path)

In [3]:
def detect_faces(frame):
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
    return faces


In [4]:
import time

start_time = time.time()
for frame in frames:
    faces = detect_faces(frame)
    for (x, y, w, h) in faces:
        cv2.rectangle(frame, (x, y), (x+w, y+h), (255, 0, 0), 2)
end_time = time.time()
print("Processing time: {:.2f} ms".format((end_time - start_time) * 1000))





Processing time: 0.05 ms


In [5]:
from pathlib import Path

frames_folder='data/frames'
tot_time=0
for i,frame_file in enumerate(sorted(os.listdir(frames_folder))):
    frame_path = os.path.join(frames_folder, frame_file)
    frame = cv2.imread(frame_path)

    # Convert the frame to grayscale for face detection
    gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Perform face detection
    start_time = time.time()
    faces = face_cascade.detectMultiScale(gray_frame, scaleFactor=1.1, minNeighbors=15, minSize=(30, 30))
    end_time = time.time()

    # Draw rectangles around the detected faces
    for (x, y, w, h) in faces:
        cv2.rectangle(frame, (x, y), (x + w, y + h), (255, 0, 0), 2)

    # Calculate and print the time taken to process each frame
    processing_time = end_time - start_time
    print(f"Time taken to process frame {frame_file}: {processing_time} seconds")

    detected_frames_dir = Path('data/detected_frames')
    if not os.path.exists(detected_frames_dir):
        os.makedirs(detected_frames_dir)

    output_filename = detected_frames_dir / f'frame_{i:04d}.jpg'
    cv2.imwrite(str(output_filename), frame)

    tot_time += processing_time

print('Total time taken:', tot_time)

Time taken to process frame frame_0001.jpg: 0.03980565071105957 seconds
Time taken to process frame frame_0002.jpg: 0.030321598052978516 seconds
Time taken to process frame frame_0003.jpg: 0.033564090728759766 seconds
Time taken to process frame frame_0004.jpg: 0.033434152603149414 seconds
Time taken to process frame frame_0005.jpg: 0.030947208404541016 seconds
Time taken to process frame frame_0006.jpg: 0.034698486328125 seconds
Time taken to process frame frame_0007.jpg: 0.035607099533081055 seconds
Time taken to process frame frame_0008.jpg: 0.036280155181884766 seconds
Time taken to process frame frame_0009.jpg: 0.03262901306152344 seconds
Time taken to process frame frame_0010.jpg: 0.03611946105957031 seconds
Time taken to process frame frame_0011.jpg: 0.032827138900756836 seconds
Time taken to process frame frame_0012.jpg: 0.03375577926635742 seconds
Time taken to process frame frame_0013.jpg: 0.03561544418334961 seconds
Time taken to process frame frame_0014.jpg: 0.0367581844329

3. Face detection visualization. Visualize the face detections made over the first 30s frames as a new video. Link to the video from your google drive. Watch the video and draw three conclusions about when does the face detector work or fail. Why do you think this is the case?
Hint: You can use cv2.rectangle to draw boxes on the image and then save them back to disk. Then ffmpeg can be used again to stitch together the frames into a new video.

In [19]:
# make results directory
results_dir = "results"
if not os.path.exists(results_dir):
    os.makedirs(results_dir)

# Converting the detected frames to a video
!ffmpeg -framerate 24 -i data/detected_frames/frame_%04d.jpg -c:v libx264 -profile:v high -crf 20 -pix_fmt yuv420p "results"/proc_vid.mp4


ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 11.3.0 (conda-forge gcc 11.3.0-19)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --disable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libfontconfig --enable-libopenh264 --enable-gnu

4. [1.5 point] Association-based tracking. Tracking can be used to associate face detections across time and understand that it is the same character appearing across multiple frames of the movie. We will explore a simple way to perform tracking.
(i) Generate face tracks by comparing face detections in two consecutive frames and associating them based on IoU scores. You may want to associate faces only when IoU > 0.5. Do consider what happens when there are multiple face detections in both frames. Start new tracks for faces not seen in the previous frame. End existing tracks when faces are not visible in the next frame. How many unique tracks did you create in the first 30 seconds?
(ii) Update the video visualization above to now include a unique track identifier (an integer number is fine), shown inside each box. Link to the video from your google drive.
Hint: You may use cv2.putText to write these numbers. Make sure they are readable after stitching together the frames into a video.
(iii) Comment about the quality of the face tracks. Do different people get associated in one track? Is a unique character associated with one unique track id? Note the timestamps of some failure cases and explain why.

In [22]:
import cv2

cascade_path = 'haarcascade_frontalface_default.xml'
face_cascade = cv2.CascadeClassifier(cascade_path)

# Path to the folder containing frames
frames_folder = "data/frames"

# Initialize variables for tracking
prev_faces = None
face_tracks = []

# Process each frame
for frame_file in sorted(os.listdir(frames_folder)):
    frame_path = os.path.join(frames_folder, frame_file)
    frame = cv2.imread(frame_path)
    frame_copy = frame.copy()

    # Convert the frame to grayscale for face detection
    gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Perform face detection
    faces = face_cascade.detectMultiScale(gray_frame, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))

    # Associate faces with previous frame based on IoU
    if prev_faces is not None and len(prev_faces) > 0 and len(faces) > 0:
        for prev_face_id, prev_face in enumerate(prev_faces):
            for curr_face_id, curr_face in enumerate(faces):
                # Calculate IoU and associate faces if IoU > 0.5
                x1_prev, y1_prev, w_prev, h_prev = prev_face
                x1_curr, y1_curr, w_curr, h_curr = curr_face
                x2_prev, y2_prev = x1_prev + w_prev, y1_prev + h_prev
                x2_curr, y2_curr = x1_curr + w_curr, y1_curr + h_curr

                x_left = max(x1_prev, x1_curr)
                y_top = max(y1_prev, y1_curr)
                x_right = min(x2_prev, x2_curr)
                y_bottom = min(y2_prev, y2_curr)

                if x_right > x_left and y_bottom > y_top:
                    intersection_area = (x_right - x_left) * (y_bottom - y_top)
                    prev_area = w_prev * h_prev
                    curr_area = w_curr * h_curr
                    iou = intersection_area / float(prev_area + curr_area - intersection_area)

                    if iou > 0.5:
                        face_tracks.append((prev_face_id, curr_face_id))

    # Update previous faces for next frame
    prev_faces = faces

    # Debugging: Print prev_faces and face_tracks
    print("prev_faces:", prev_faces)
    print("face_tracks:", face_tracks)

    # Visualize face tracks on the frame
    for track in face_tracks:
        prev_face_id, curr_face_id = track
        if prev_face_id < len(prev_faces) and curr_face_id < len(faces):  # Ensure indices are within range
            x_prev, y_prev, w_prev, h_prev = prev_faces[prev_face_id]
            x_curr, y_curr, w_curr, h_curr = faces[curr_face_id]
            cv2.line(frame_copy, (x_prev + w_prev // 2, y_prev + h_prev // 2), (x_curr + w_curr // 2, y_curr + h_curr // 2), (0, 255, 0), 2)

    # Display the frame with face tracks
    cv2.imshow('Face Tracks', frame_copy)
    cv2.waitKey(100)  # Adjust waitKey value for desired playback speed

# Release resources
cv2.destroyAllWindows()

prev_faces: [[325 322 192 192]]
face_tracks: []


QStandardPaths: wrong permissions on runtime directory /run/user/1000/, 0755 instead of 0700
glx: failed to create drisw screen
failed to load driver: zink


prev_faces: [[321 320 192 192]]
face_tracks: [(0, 0)]
prev_faces: [[324 320 194 194]]
face_tracks: [(0, 0), (0, 0)]
prev_faces: [[330 322 192 192]]
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: ()
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: [[335 324 189 189]]
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: ()
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: [[761 243  94  94]]
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: [[354 330 181 181]]
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: ()
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: ()
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: ()
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: [[802 268  66  66]
 [339 318 198 198]]
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: ()
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: [[338 324 184 184]]
face_tracks: [(0, 0), (0, 0), (0, 0)]
prev_faces: [[346 326 178 178]]
face_tracks: [(0, 0), (0, 0), (0, 0), (0, 0)]
prev_faces: [[327 313 199 199]]
face_tracks: [(0,

In [23]:
import numpy as np

cascade_path = 'haarcascade_frontalface_default.xml'
face_cascade = cv2.CascadeClassifier(cascade_path)

# Disable OpenCL to avoid the CL_MEM_OBJECT_ALLOCATION_FAILURE error
cv2.ocl.setUseOpenCL(False)

# Path to the folder containing frames
frames_folder = "data/frames"

# Define a function to calculate IoU (Intersection over Union)
def calculate_iou(bbox1, bbox2):
    x1, y1, w1, h1 = bbox1
    x2, y2, w2, h2 = bbox2

    # Calculate coordinates of intersection rectangle
    x_intersection = max(x1, x2)
    y_intersection = max(y1, y2)
    w_intersection = min(x1 + w1, x2 + w2) - x_intersection
    h_intersection = min(y1 + h1, y2 + h2) - y_intersection

    # Calculate area of intersection rectangle
    intersection_area = max(0, w_intersection) * max(0, h_intersection)

    # Calculate area of bounding boxes
    area_bbox1 = w1 * h1
    area_bbox2 = w2 * h2

    # Calculate IoU
    iou = intersection_area / float(area_bbox1 + area_bbox2 - intersection_area)

    return iou

# Initialize variables for tracking
tracks = []  # List to store active tracks
track_id_counter = 0  # Counter to assign unique IDs to tracks

# Process each frame
for frame_file in sorted(os.listdir(frames_folder)):
    frame_path = os.path.join(frames_folder, frame_file)
    frame = cv2.imread(frame_path)

    # Convert the frame to grayscale for face detection
    gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Perform face detection
    faces = face_cascade.detectMultiScale(gray_frame, scaleFactor=1.1, minNeighbors=15, minSize=(30, 30))

    # Update existing tracks
    for track in tracks:
        # Check if the track is active in the current frame
        if track['active']:
            # Calculate IoU with each detected face
            max_iou = 0
            best_match_index = -1
            for i, face_bbox in enumerate(faces):
                iou = calculate_iou(track['bbox'], face_bbox)
                if iou > max_iou:
                    max_iou = iou
                    best_match_index = i

            # If a matching face is found (IoU > 0.5), update the track
            if max_iou > 0.5:
                track['bbox'] = faces[best_match_index]
                track['frames'].append(frame_file)
                faces = np.delete(faces, best_match_index, axis=0)
            else:
                # End the track if the face is not detected in the current frame
                track['active'] = False

    # Start new tracks for undetected faces
    for new_face_bbox in faces:
        new_track = {
            'id': track_id_counter,
            'bbox': new_face_bbox,
            'frames': [frame_file],
            'active': True
        }
        tracks.append(new_track)
        track_id_counter += 1

# Count the unique tracks created
unique_tracks = set(track['id'] for track in tracks)
print("Number of unique tracks:", len(unique_tracks))


Number of unique tracks: 70


4(ii). Visualisation of face tracks

In [14]:
import cv2
import os
import numpy as np

# Load the pre-trained Haar cascade classifier for face detection
cascade_path = 'haarcascade_frontalface_default.xml'
face_cascade = cv2.CascadeClassifier(cascade_path)

# Path to the folder containing frames
frames_folder = "data/frames"
output_folder = "data/output_frames"

# Define a function to calculate IoU (Intersection over Union)
def calculate_iou(bbox1, bbox2):
    x1, y1, w1, h1 = bbox1
    x2, y2, w2, h2 = bbox2

    # Calculate coordinates of intersection rectangle
    x_intersection = max(x1, x2)
    y_intersection = max(y1, y2)
    w_intersection = min(x1 + w1, x2 + w2) - x_intersection
    h_intersection = min(y1 + h1, y2 + h2) - y_intersection

    # Calculate area of intersection rectangle
    intersection_area = max(0, w_intersection) * max(0, h_intersection)

    # Calculate area of bounding boxes
    area_bbox1 = w1 * h1
    area_bbox2 = w2 * h2

    # Calculate IoU
    iou = intersection_area / float(area_bbox1 + area_bbox2 - intersection_area)

    return iou

# Initialize variables for tracking
tracks = []  # List to store active tracks
track_id_counter = 0  # Counter to assign unique IDs to tracks

# Process each frame
for frame_file in sorted(os.listdir(frames_folder)):
    frame_path = os.path.join(frames_folder, frame_file)
    frame = cv2.imread(frame_path)

    # Convert the frame to grayscale for face detection
    gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Perform face detection
    faces = face_cascade.detectMultiScale(gray_frame, scaleFactor=1.1, minNeighbors=15, minSize=(30, 30))

    # Update existing tracks
    for track in tracks:
        # Check if the track is active in the current frame
        if track['active']:
            # Calculate IoU with each detected face
            max_iou = 0
            best_match_index = -1
            for i, face_bbox in enumerate(faces):
                iou = calculate_iou(track['bbox'], face_bbox)
                if iou > max_iou:
                    max_iou = iou
                    best_match_index = i

            # If a matching face is found (IoU > 0.5), update the track
            if max_iou > 0.5:
                track['bbox'] = faces[best_match_index]
                track['frames'].append(frame_file)
                faces = np.delete(faces, best_match_index, axis=0)
            else:
                # End the track if the face is not detected in the current frame
                track['active'] = False

    # Start new tracks for undetected faces
    for new_face_bbox in faces:
        new_track = {
            'id': track_id_counter,
            'bbox': new_face_bbox,
            'frames': [frame_file],
            'active': True
        }
        tracks.append(new_track)
        track_id_counter += 1

    # Draw rectangles around faces and put track ID inside each box
    for track in tracks:
        if track['active'] and frame_file in track['frames']:
            x, y, w, h = track['bbox']
            cv2.rectangle(frame, (x, y), (x + w, y + h), (255, 0, 0), 2)
            
            
            font = cv2.FONT_HERSHEY_SIMPLEX 
            fontScale = 1
            color = (0, 255, 0) 
            thickness = 2
            
            cv2.putText(frame, str(track['id']), (x + 2, y - 10), font, fontScale, color, thickness, cv2.LINE_AA, False)

    # Save the frame with rectangles and track IDs
    output_path = os.path.join(output_folder, frame_file)
    cv2.imwrite(output_path, frame)


In [15]:
import subprocess

# Use ffmpeg to stitch the frames into a new video
output_video = "output_video_with_track_id.mp4"
ffmpeg_cmd = f"ffmpeg -y -r 25 -i {output_folder}/frame_%04d.jpg -vcodec libx264 -crf 25 -pix_fmt yuv420p results/{output_video}"

# running the command
subprocess.run(ffmpeg_cmd, shell=True)

ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 11.3.0 (conda-forge gcc 11.3.0-19)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1674566204550/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --disable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libfontconfig --enable-libopenh264 --enable-gnu

CompletedProcess(args='ffmpeg -y -r 25 -i data/output_frames/frame_%04d.jpg -vcodec libx264 -crf 25 -pix_fmt yuv420p results/output_video_with_track_id.mp4', returncode=0)

### Algorithm Used - Viola-Jones Haar Cascades

* load the pre-trained Haar cascade classifier for face detection 
* iterate through each frame in the folder and process them one by one.
* use the detectMultiScale method of the cascade classifier to detect faces in the grayscale frame.
* draw rectangles around the detected faces.
* display the processed frame with the detected faces.


### Observations: Factors Affecting Processing Time

| Factor               | Effect on Processing Time                                                                                                                                 |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Frame Size**       | Larger frames generally take more time to process than smaller frames.                                                                                    |
| **Number of Faces**  | More faces in a frame result in longer processing times to detect them.                                                                                   |
| **Scale Factor**     | A smaller scaleFactor increases the processing time due to a more exhaustive search, while a larger scaleFactor decreases it but might miss smaller faces.|
| **Minimum Neighbors**| Increasing minNeighbors increases processing time as it requires more computational effort to filter out non-face regions, reducing false positives.        |
| **Minimum Size**     | Adjusting minSize affects the number of regions examined; smaller values increase processing time due to more potential false positives.                  |
| **Haar Cascade XML File** | The complexity and size of the XML file can affect the initialization time of the cascade classifier, and different files may vary in processing times.    |




### Observations during face tracking

* **Successful Detections:** In most cases the faces are successfully detected, ignoring the false positives.

* **Missed Detections:** There are very few missed detections. Missed detections occur due to low image quality and occlusions mainly.

* **False Positives:** There were false positives.False positives can occur due to factors such as similar patterns or shapes in the image that resemble faces. The detector detects parts of fabrics or empty spaces as faces.

Parameter tuning has been done to achieve better results


### Observations during association based tracking

We observe that the face detection is pretty accurate. However the association of a person withe the same face track is not very great. When the expression of one persion changes signficantlt, it is starting a new face track. However, different people do not get associated in one track.
In most cases, a unique person is associated with a unique trackID.

### Observations: Failure Cases in Tracking

| Time     | Description                                                                                                                                                      |
|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **0:03** | A new track starts for the girl instead of continuing the existing one. This issue, due to an IOU below 0.5, may stem from low video resolution or face occlusion.|
| **0:06** | Detection of two kids throwing stones is missed, likely due to occlusions and poor video quality which impede face detection.                                      |
| **0:15** | A false positive occurs when the boy's shirt sleeve is mistakenly identified as a face.                                                                           |
| **0:23 to 0:25** | The girl is assigned three different track IDs due to significant changes in her expressions across frames. Poor video quality causes her face to be detected as new. |
