# Capstone Project: Object Detection and Tracking Pipeline

This capstone project guides you through building an end-to-end object detection and tracking pipeline. We'll cover concepts from beginner to advanced levels, with complete code examples, explanations, and additional examples for better understanding.

## Overview

Object detection identifies and locates objects in images or videos, while tracking follows these objects across frames. This pipeline includes:
- Dataset preparation
- Model training/fine-tuning
- Inference on video
- Object tracking
- Evaluation and visualization

## Prerequisites

- Python 3.8+
- OpenCV
- NumPy
- Matplotlib
- For advanced sections: PyTorch or TensorFlow

Install requirements:
```bash
pip install opencv-python numpy matplotlib torch torchvision
```

## Beginner Level: Introduction to Object Detection and Tracking

### What is Object Detection?

Object detection is the task of identifying and locating objects in an image or video frame. It involves:
- Classification: What is the object?
- Localization: Where is the object?

### What is Object Tracking?

Object tracking follows the movement of objects across multiple frames in a video sequence.

### Simple Example: Face Detection using Haar Cascades

Let's start with a basic face detection example using OpenCV's Haar cascade classifier.

In [None]:
import cv2
import matplotlib.pyplot as plt

# Load the pre-trained Haar cascade for face detection
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')

# Load an image
image = cv2.imread('datasets/sample_images/group.jpg')

# Convert to grayscale for detection
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Detect faces
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))

# Draw rectangles around detected faces
for (x, y, w, h) in faces:
    cv2.rectangle(image, (x, y), (x+w, y+h), (255, 0, 0), 2)

# Display the result
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.title('Face Detection')
plt.show()

print(f"Detected {len(faces)} faces")

### Explanation

- We load a pre-trained Haar cascade classifier for frontal faces.
- Convert the image to grayscale (Haar cascades work on grayscale).
- Use `detectMultiScale` to find faces at different scales.
- Draw rectangles around detected faces.

### Basic Tracking: Mean Shift Tracking

Now let's implement basic object tracking using Mean Shift algorithm.

In [None]:
import cv2
import numpy as np

# Load video
cap = cv2.VideoCapture('datasets/sample_videos/sample_clip.mp4')

# Read first frame
ret, frame = cap.read()
if not ret:
    print("Cannot read video")
    exit()

# Select ROI (Region of Interest) manually or use detected object
# For simplicity, we'll use a fixed ROI
r, h, c, w = 200, 100, 300, 100  # region of interest
track_window = (c, r, w, h)

# Set up the ROI for tracking
roi = frame[r:r+h, c:c+w]
hsv_roi = cv2.cvtColor(roi, cv2.COLOR_BGR2HSV)
mask = cv2.inRange(hsv_roi, np.array((0., 60., 32.)), np.array((180., 255., 255.)))
roi_hist = cv2.calcHist([hsv_roi], [0], mask, [180], [0, 180])
cv2.normalize(roi_hist, roi_hist, 0, 255, cv2.NORM_MINMAX)

# Setup the termination criteria
term_crit = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 1)

while True:
    ret, frame = cap.read()
    if ret:
        hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
        dst = cv2.calcBackProject([hsv], [0], roi_hist, [0, 180], 1)

        # Apply meanshift to get the new location
        ret, track_window = cv2.meanShift(dst, track_window, term_crit)

        # Draw it on image
        x, y, w, h = track_window
        cv2.rectangle(frame, (x, y), (x+w, y+h), 255, 2)

        cv2.imshow('Tracking', frame)

        if cv2.waitKey(60) & 0xFF == ord('q'):
            break
    else:
        break

cap.release()
cv2.destroyAllWindows()

print("Basic tracking completed")

### Additional Example: Coin Detection

Let's detect coins in an image using contour detection.

In [None]:
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Load image
image = cv2.imread('datasets/sample_images/coins.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Apply Gaussian blur
blurred = cv2.GaussianBlur(gray, (11, 11), 0)

# Threshold the image
ret, thresh = cv2.threshold(blurred, 60, 255, cv2.THRESH_BINARY)

# Find contours
contours, hierarchy = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Draw contours
image_copy = image.copy()
cv2.drawContours(image_copy, contours, -1, (0, 255, 0), 2)

plt.imshow(cv2.cvtColor(image_copy, cv2.COLOR_BGR2RGB))
plt.title('Coin Detection')
plt.show()

print(f"Detected {len(contours)} coins")

## Intermediate Level: Pre-trained Models and Advanced Tracking

### Using YOLO for Object Detection

YOLO (You Only Look Once) is a popular real-time object detection system.

In [None]:
import cv2
import numpy as np

# Load YOLO
net = cv2.dnn.readNet('yolov3.weights', 'yolov3.cfg')
layer_names = net.getLayerNames()
output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]

# Load COCO class names
with open('coco.names', 'r') as f:
    classes = [line.strip() for line in f.readlines()]

# Load image
img = cv2.imread('datasets/sample_images/sample1.png')
height, width, channels = img.shape

# Detecting objects
blob = cv2.dnn.blobFromImage(img, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
net.setInput(blob)
outs = net.forward(output_layers)

# Showing informations on the screen
class_ids = []
confidences = []
boxes = []
for out in outs:
    for detection in out:
        scores = detection[5:]
        class_id = np.argmax(scores)
        confidence = scores[class_id]
        if confidence > 0.5:
            # Object detected
            center_x = int(detection[0] * width)
            center_y = int(detection[1] * height)
            w = int(detection[2] * width)
            h = int(detection[3] * height)

            # Rectangle coordinates
            x = int(center_x - w / 2)
            y = int(center_y - h / 2)

            boxes.append([x, y, w, h])
            confidences.append(float(confidence))
            class_ids.append(class_id)

# Apply non-max suppression
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)

# Draw bounding boxes
for i in range(len(boxes)):
    if i in indexes:
        x, y, w, h = boxes[i]
        label = str(classes[class_ids[i]])
        cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
        cv2.putText(img, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

# Display
cv2.imshow('YOLO Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

print("YOLO detection completed")

### SORT Tracking Algorithm

SORT (Simple Online and Realtime Tracking) is an efficient tracking algorithm.

In [None]:
# Note: This requires the sort library. Install with: pip install filterpy scikit-image
# For simplicity, we'll use OpenCV's built-in tracker

import cv2

# Load video
cap = cv2.VideoCapture('datasets/sample_videos/sample_clip.mp4')

# Create tracker
tracker = cv2.TrackerCSRT_create()

# Read first frame
ret, frame = cap.read()

# Select ROI
bbox = cv2.selectROI('Tracking', frame, False)
tracker.init(frame, bbox)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Update tracker
    success, bbox = tracker.update(frame)

    if success:
        # Draw bounding box
        p1 = (int(bbox[0]), int(bbox[1]))
        p2 = (int(bbox[0] + bbox[2]), int(bbox[1] + bbox[3]))
        cv2.rectangle(frame, p1, p2, (255, 0, 0), 2, 1)
    else:
        cv2.putText(frame, "Tracking failure detected", (100, 80), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 0, 255), 2)

    cv2.imshow('Tracking', frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

print("Advanced tracking completed")

### Additional Example: Vehicle Detection and Tracking

Using Haar cascades for vehicle detection.

In [None]:
import cv2

# Load Haar cascade for cars (you may need to download this)
# car_cascade = cv2.CascadeClassifier('cars.xml')

# For demonstration, using face detection on road image
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')

# Load image
img = cv2.imread('datasets/sample_images/road.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Detect (simulating vehicle detection)
objects = face_cascade.detectMultiScale(gray, 1.1, 4)

# Draw rectangles
for (x, y, w, h) in objects:
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)

cv2.imshow('Vehicle Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

print(f"Detected {len(objects)} objects")

## Advanced Level: Deep Learning and Custom Solutions

### Fine-tuning a Pre-trained Model

Using PyTorch to fine-tune a Faster R-CNN model for custom object detection.

In [None]:
import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.transforms import functional as F
import cv2
import numpy as np

# Load pre-trained model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Load image
img_path = 'datasets/sample_images/sample1.png'
img = cv2.imread(img_path)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# Transform image
img_tensor = F.to_tensor(img_rgb)

# Make prediction
with torch.no_grad():
    predictions = model([img_tensor])

# Process predictions
pred_boxes = predictions[0]['boxes'].cpu().numpy()
pred_scores = predictions[0]['scores'].cpu().numpy()
pred_labels = predictions[0]['labels'].cpu().numpy()

# Filter predictions with confidence > 0.5
keep = pred_scores > 0.5
pred_boxes = pred_boxes[keep]
pred_labels = pred_labels[keep]

# Load COCO class names
COCO_CLASSES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 
    'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A',
    'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

# Draw bounding boxes
for box, label in zip(pred_boxes, pred_labels):
    x1, y1, x2, y2 = box.astype(int)
    cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
    cv2.putText(img, COCO_CLASSES[label], (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

# Display
cv2.imshow('Faster R-CNN Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

print("Advanced detection completed")

### Kalman Filter for Tracking

Implementing Kalman filter for more robust tracking.

In [None]:
import cv2
import numpy as np

# Simple Kalman Filter implementation
class KalmanFilter:
    def __init__(self):
        self.kalman = cv2.KalmanFilter(4, 2)
        self.kalman.measurementMatrix = np.array([[1, 0, 0, 0], [0, 1, 0, 0]], np.float32)
        self.kalman.transitionMatrix = np.array([[1, 0, 1, 0], [0, 1, 0, 1], [0, 0, 1, 0], [0, 0, 0, 1]], np.float32)
        self.kalman.processNoiseCov = np.array([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], np.float32) * 0.03

    def predict(self):
        return self.kalman.predict()

    def correct(self, measurement):
        self.kalman.correct(np.array([[np.float32(measurement[0])], [np.float32(measurement[1])]]))

# Example usage (simplified)
kf = KalmanFilter()

# Simulated measurements
measurements = [(100, 100), (105, 102), (110, 105), (115, 108)]

for i, measurement in enumerate(measurements):
    prediction = kf.predict()
    kf.correct(measurement)
    print(f"Measurement: {measurement}, Prediction: {prediction.flatten()[:2]}")

print("Kalman filter tracking example completed")

### Additional Example: Multi-Object Tracking with DeepSORT

For more advanced multi-object tracking, you can integrate DeepSORT with YOLO.

In [None]:
# Note: DeepSORT requires additional setup. This is a conceptual example.

# Pseudocode for DeepSORT integration
# from deep_sort_realtime.deepsort_tracker import DeepSort

# Initialize DeepSORT tracker
# tracker = DeepSort(max_age=30, n_init=3, nms_max_overlap=1.0)

# In video processing loop:
# detections = [Detection(bbox, confidence, class_name) for bbox, confidence, class_name in detected_objects]
# tracks = tracker.update_tracks(detections, frame=frame)
# for track in tracks:
#     if not track.is_confirmed():
#         continue
#     track_id = track.track_id
#     bbox = track.to_ltrb()
#     # Draw track

print("DeepSORT multi-object tracking concept explained")

## Evaluation and Metrics

### Common Metrics for Object Detection

- IoU (Intersection over Union)
- Precision, Recall, F1-Score
- mAP (mean Average Precision)

### Tracking Metrics

- MOTA (Multiple Object Tracking Accuracy)
- MOTP (Multiple Object Tracking Precision)
- IDF1 (ID F1 Score)

## Next Steps

1. Experiment with different datasets
2. Implement real-time processing
3. Add more advanced models like YOLOv5 or Detectron2
4. Integrate with edge devices for deployment
5. Explore 3D object detection and tracking

## Deliverables

- `inference.py` script for running the pipeline
- Annotated video in `results/` folder
- Complete notebook with examples