In [12]:
# Run this to cleanup environment if you need to rerun the code!
import os
os.system("rm *.mp4")
os.system("rm *.txt")
os.system("rm *.csv")
os.system("rm -rf result*")

0

# Drone follow me using Kalman Filters

Multi-Object Tracking (MOT) is a core visual ability that humans poses to perform kinetic tasks and coordinate other tasks. The AI community has recognized the importance of MOT via a series of [competitions](https://motchallenge.net). 

In this assignment, the object class is `bicycle` and `car` the ability to track these objects  will be demonstrated using [Kalman Filters](https://en.wikipedia.org/wiki/Kalman_filter).  


## Task 1: Setup your development environment and store the test video locally (10 points)

Your environment must be docker based and you can use any TF2 or PT2 based docker container compatible with your environment. You can also use colab. 

In [4]:
from pytube import YouTube
YouTube('https://www.youtube.com/watch?v=WeF4wpw7w9kWeF4wpw7w9k').streams.first().download()
YouTube('https://www.youtube.com/watch?v=2NFwY15tRtA').streams.first().download()
YouTube('https://www.youtube.com/watch?v=5dRramZVu2Q').streams.first().download()

'/home/jovyan/CS370-Assignments/Drone Assignment/Drone Tracking Video.mp4'

In [5]:
input_file1 = 'Cyclist and vehicle Tracking - 1.mp4'
input_file2 = 'Cyclist and vehicle Tracking - 2.mp4'
input_file3 = 'Drone Tracking Video.mp4'

## Task 2: Object Detection (40 points)

Perform object detection on the following videos. 

```{eval-rst}
.. youtube:: https://www.youtube.com/watch?v=WeF4wpw7w9kWeF4wpw7w9k
```

```{eval-rst}
.. youtube:: https://www.youtube.com/watch?v=2NFwY15tRtA
```

```{eval-rst}
.. youtube:: https://www.youtube.com/watch?v=5dRramZVu2Q
```
Split the videos into frames and use an object detector of your choice, in a framework of your choice to detect the cyclists.  

In [6]:
import pandas as pd
import numpy as np
import cv2
import matplotlib.pyplot as plt
import os

from glob import glob

import IPython.display as ipd
from tqdm.notebook import tqdm

In [7]:
# Load in video capture
cap1 = cv2.VideoCapture('input_file1.mp4')
cap2 = cv2.VideoCapture('input_file2.mp4')
cap3 = cv2.VideoCapture('input_file3.mp4')

In [11]:
from ultralytics import YOLO
import csv

if os.path.isdir("./result1") != True:
    os.mkdir("./result1")

# Load the model
# mac = yolov8n.pt
# windows = yolov8x.pt
model = YOLO('yolov8x.pt')

cap = cv2.VideoCapture(input_file1)
n_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

# Get video fps
fps = cap.get(cv2.CAP_PROP_FPS)

# Open CSV file for writing
csv_filename = "output.csv"
csv_file = open(csv_filename, "w", newline="")
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["vidId", "frameNum", "timestamp", "detectedObjId", "detectedObjClass", "confidence", "bbox_width", "bbox_height"])
    
minute = 0
sec = 0
    
for frame in range(n_frames):
    ret, img = cap.read()
    if not ret:
        break
    #if frame % 10 == 0:        

        #results = model(frame, device="mps")
        # use mps for macbook and device="0" for windows
    results = model.predict(source=img, save=True, device="mps", conf=0.4)

    for result in results:
        boxes = result.boxes  # Boxes object for bounding box outputs
        masks = result.masks  # Masks object for segmentation masks outputs
        keypoints = result.keypoints  # Keypoints object for pose outputs
        probs = result.probs  # Probs object for classification outputs
        result.save(filename=f'./result1/result{frame}.jpg')  # save to disk
        result.save_txt('new.txt', True)
        #csv_writer.writerow(["output1.mp4", f'{frame}', (frame / fps), result[4], result[3], result[4], result[1], result[2]])
    
    # timestamp creation
    if frame % 24 == 0:
        sec += 1
        if sec % 60 == 0: #and sec > 0:
            sec = 0
            minute += 1
        
        # testing the timestamp
        #print(f'{minute}:{sec}')


# Close CSV file
csv_file.close()
cap.release()


0: 352x640 (no detections), 1032.6ms
Speed: 6.4ms preprocess, 1032.6ms inference, 12.5ms postprocess per image at shape (1, 3, 352, 640)
Results saved to [1mruns/detect/predict[0m

0: 352x640 (no detections), 1074.9ms
Speed: 2.4ms preprocess, 1074.9ms inference, 0.2ms postprocess per image at shape (1, 3, 352, 640)
Results saved to [1mruns/detect/predict[0m

0: 352x640 (no detections), 955.5ms
Speed: 4.6ms preprocess, 955.5ms inference, 1.6ms postprocess per image at shape (1, 3, 352, 640)
Results saved to [1mruns/detect/predict[0m

0: 352x640 (no detections), 1064.8ms
Speed: 0.6ms preprocess, 1064.8ms inference, 0.3ms postprocess per image at shape (1, 3, 352, 640)
Results saved to [1mruns/detect/predict[0m

0: 352x640 (no detections), 913.7ms
Speed: 0.4ms preprocess, 913.7ms inference, 0.2ms postprocess per image at shape (1, 3, 352, 640)
Results saved to [1mruns/detect/predict[0m

0: 352x640 (no detections), 679.4ms
Speed: 1.0ms preprocess, 679.4ms inference, 0.2ms postpr

KeyboardInterrupt: 

## Task 3: Kalman Filter (50 points)

Use the  [filterpy](https://filterpy.readthedocs.io/en/latest/kalman/KalmanFilter.html) library to implement Kalman filters that will track the cyclist and the vehicle (if present) in the video. You will need to use the detections from the previous task to initialize and run the Kalman filter. 

You need to deliver a video that contains the trajectory of the objects as a line that connects the pixels that the tracker indicated. You can use the `ffmpeg` command line tool and OpenCV to superpose the bounding box of the drone on the video as well as plot its trajectory. 

Suggest methods that you can use to address  false positives and how the tracker can help you in this regard.

You will need to have one Kalman filter to track each of the required and present objects (cyclist and vehicle).

## Extra Bonus (20 points)

```{eval-rst}
.. youtube:: https://www.youtube.com/watch?v=2hQx48U1L-Y2hQx48U1L-Y
```

The cyclist in the video goes in and out of occlusions. In addition the object is small making detections fairly problematic without finetuning and other optimizations.  Fintetuning involves using the pretrained model and training it further using images of cyclists from a training dataset such as [VisDrone](https://github.com/VisDrone/VisDrone-Dataset). At the same time,  reducing the number of classes to a much smaller number such as person & bicycle may help.  Also some 2 stage detectors may need to be further optimized in terms of parameters for small objects. See [this paper](https://www.mdpi.com/1424-8220/23/15/6887) for ideas around small object tracking. 

```{note}
The extra points can only be awarded in the category of `assignments` and cannot be used to compensate for any other category such as `exams`. 
```