# Multiple Object Tracking in TensorFlow 2 with Tracktor

Tracktor is a simple but versatile object tracking architecture that leverages the power of deep learning-based object detection models to achieve near-SOTA results "without bells and whistles".

Here, we showcase a minimum working implementation of Tracktor in TensorFlow 2 using the TensorFlow Object Detection library.

Tracktor is due to ["Tracking without bells and whistles"](https://arxiv.org/abs/1903.05625) by Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe (ICCV 2019), and the original PyTorch implementation is located [here](https://github.com/phil-bergmann/tracking_wo_bnw).

We showcase Tracktor on [this video](https://mixkit.co/free-stock-video/little-girl-plays-with-her-dog-on-the-grass-14276/) from Mixkit.


## Installs and Imports

In [None]:
import os
import pathlib

# Clones the tensorflow models repository if it doesn't already exist.
if "models" in pathlib.Path.cwd().parts:
  while "models" in pathlib.Path.cwd().parts:
    os.chdir("..")
elif not pathlib.Path("models").exists():
  !git clone --depth 1 https://github.com/tensorflow/models

In [None]:
# Installs the Object Detection API.
%%bash
cd models/research/
protoc object_detection/protos/*.proto --python_out=.
cp object_detection/packages/tf2/setup.py .
python -m pip install .

In [None]:
from copy import deepcopy
from glob import glob

import imageio
from IPython.display import display
from IPython.display import Image as IPyImage
import numpy as np
from PIL import Image
from six import BytesIO
import tensorflow as tf

from object_detection.utils.config_util import get_configs_from_pipeline_file
from object_detection.utils.label_map_util import create_category_index_from_labelmap
from object_detection.utils.visualization_utils import visualize_boxes_and_labels_on_image_array as visualize
from object_detection.builders import model_builder

In [None]:
# Downloads pretrained Faster R-CNN model.
%%bash
wget http://download.tensorflow.org/models/object_detection/tf2/20200711/faster_rcnn_resnet152_v1_1024x1024_coco17_tpu-8.tar.gz
tar -xzvf faster_rcnn_resnet152_v1_1024x1024_coco17_tpu-8.tar.gz
mv faster_rcnn_resnet152_v1_1024x1024_coco17_tpu-8/checkpoint models/research/object_detection/test_data/
rm faster_rcnn_resnet152_v1_1024x1024_coco17_tpu-8.tar.gz

## Load pretrained Faster R-CNN

In [None]:
# Sets model variables.
model_name = "faster_rcnn_resnet152_v1_1024x1024_coco17_tpu-8"
data_dir = "models/research/object_detection/test_data/"
model_dir = os.path.join(data_dir, "checkpoint")

# Loads model configuration.
pipeline_config = os.path.join("models/research/object_detection/configs/tf2/",
                                model_name + ".config")
configs = get_configs_from_pipeline_file(pipeline_config)
model_config = configs["model"]

# Builds model (a Faster R-CNN Meta-Arch object) using config.
model = model_builder.build(model_config=model_config, is_training=False)

# Loads model checkpoint.
ckpt = tf.train.Checkpoint(model=model)
ckpt.restore(os.path.join(model_dir, "ckpt-0")).expect_partial()

In [None]:
# Loads category index using COCO label map.
PATH_TO_LABELS = "models/research/object_detection/data/mscoco_label_map.pbtxt"
category_index = create_category_index_from_labelmap(PATH_TO_LABELS,
                                                     use_display_name=True)

In [None]:
# Defines a function to convert an image to a numpy array.
def image_to_np(path):
    image = Image.open(path)
    width, height = image.size
    image = image.resize((width // 2, height // 2))
    image = np.array(image)
    image = image.reshape((height // 2, width // 2, 3))
    return image

In [None]:
# Downloads images from GCP.
%%bash
mkdir "models/research/object_detection/test_images/tracktor_images"
cd "models/research/object_detection/test_images/tracktor_images"
wget https://storage.googleapis.com/object-detection-dogfood/data/tracktor_images.tar.gz
tar -xzvf "tracktor_images.tar.gz"
rm "tracktor_images.tar.gz"

In [None]:
# Converts input images to numpy.
images = []
image_paths = "models/research/object_detection/test_images/tracktor_images/*.png"
for path in sorted(glob(image_paths)):
    image = image_to_np(path)
    images.append(image)

In [None]:
# Defines Track class.
class Track:
    def __init__(self, id, box, cls, score):
        self.id = id
        self.box = box
        self.cls = cls
        self.score = score
    def __repr__(self):
        return str(self.id)

## Object Detection and Tracking with Tracktor

The basic concept of Tracktor is to repurpose the regression head of the object detection model to regress the previous frame's bounding boxes onto the current frame's features. This allows us to associate bounding boxes between frames without an additional tracking model (hence, "without bells and whistles").

Our implementation does not include reidentification, as it often requires training a separate neural network (such as a Siamese network). Thus, we follow a 4-step process:


1. Run detection using the Faster R-CNN.
2. Regress previous-frame bounding boxes onto current-frame features; updates or removes tracks depending on regression confidence.
3. Run non-maximum suppression to remove detection boxes which are already covered by tracks.
4. Instantiate new tracks.

Finally, it's worth noting the weakness of Tracktor: it requires a very good framerate to perform at its best, since the regression is most receptive to minute changes in the frame features. One can somewhat compensate for this weakness via a separate motion model, which we have not included here.


In [None]:
# Sets threshold variables.
# DET_CONF_THRESH is threshold to include a detection.
# REG_CONF_THRESH is threshold to include a regression.
# NMS_IOU_THRESH is threshold to remove boxes in NMS based on IOU.
# NMS_SCORE_THRESH is threshold to remove boxes in NMS based on score.
# LABEL_ID_OFFSET is the difference between 0 and the first class index.
DET_CONF_THRESH = 0.8
REG_CONF_THRESH = 0.15
NMS_IOU_THRESH = 0.3
NMS_SCORE_THRESH = 0.3
LABEL_ID_OFFSET = 1

tracks = []
dead_tracks = []
tracks_per_frame = []
for num, image in enumerate(images):
    print(f"Frame {num + 1}/{len(images)}")
    # Converts numpy array to Tensor and adds batch dimension.
    input_tensor = tf.convert_to_tensor(image, dtype=tf.float32)
    input_tensor = tf.expand_dims(input_tensor, axis=0)

    # Runs detection model on image.
    inputs, true_shapes = model.preprocess(input_tensor)
    pred = model.predict(inputs, true_shapes)
    det = model.postprocess(pred, true_shapes)

    # Extracts boxes, classes, and scores from detection output.
    boxes = det["detection_boxes"][0].numpy()
    classes = det["detection_classes"][0].numpy()
    scores = det["detection_scores"][0].numpy()

    # Thresholds detection output.
    keep = scores >= DET_CONF_THRESH
    boxes = boxes[keep]
    classes = classes[keep]
    scores = scores[keep]

    # Regresses previous frame tracks onto current frame features.
    # This leverages the Faster R-CNN architecture to update track
    # position without extra tracking "bells and whistles".
    if tracks:
        # Sets the current number of tracks.
        num_proposals = len(tracks)

        # Extracts features from detection.
        feats = pred["rpn_features_to_crop"]
        preprocessed_shapes = pred["image_shape"]

        # Converts tracks to Tensor and pads to model.max_num_proposals.
        # Note that if the detection model resizes the image, you will need
        # to adjust the bounding box coordinates accordingly.
        tracks_boxes = tf.convert_to_tensor([track.box for track in tracks],
                                            dtype=tf.float32)
        padding = ((0, model.max_num_proposals - num_proposals), (0, 0))
        tracks_boxes = tf.pad(tracks_boxes, padding)
        tracks_boxes = tf.expand_dims(tracks_boxes, axis=0)

        # Runs box_prediction to regress boxes onto current features.
        # Adds extra fields to the output and postprocesses.
        box_pred = model._box_prediction(feats,
                                         tracks_boxes,
                                         preprocessed_shapes,
                                         true_shapes)
        box_pred["num_proposals"] = tf.convert_to_tensor([num_proposals],
                                                         dtype=tf.int32)
        box_pred["rpn_features_to_crop"] = feats
        reg = model.postprocess(box_pred, true_shapes)

        # Extracts boxes, classes, and scores from regression output.
        reg_boxes = reg["detection_boxes"][0].numpy()
        reg_classes = reg["detection_classes"][0].numpy()
        reg_scores = reg["detection_scores"][0].numpy()
        indices = reg["detection_anchor_indices"][0].numpy()

        # Gets highest-confidence regression for each track.
        tracks_to_update = set(range(num_proposals))
        to_delete = []
        while tracks_to_update:
            for i, j in enumerate(indices):
                if j in tracks_to_update:
                    track = tracks[j]
                    # If confidence is low, delete the track.
                    if reg_scores[i] < REG_CONF_THRESH:
                        to_delete.append(track)
                    # If confidence is high, update the track.
                    else:
                        track.box = reg_boxes[i]
                        track.cls = reg_classes[i]
                        track.score = reg_scores[i]
                    tracks_to_update.remove(j)

            # If regression fails, delete the track.
            for j in tracks_to_update:
                to_delete.append(tracks[j])
            break
            
        # Deletes tracks with no high-confidence regression in current frame.
        for track in to_delete:
            tracks.remove(track)
            dead_tracks.append(track)
    
    # Filters boxes which are already covered by tracks.
    if tracks and boxes.size != 0:
        # Sets variables for non-maximum suppression (NMS).
        # We let all the current tracks have a score of 2 so they are never
        # removed. This will remove the boxes which most overlap with
        # current tracks.
        nms_boxes = np.concatenate((boxes, [track.box for track in tracks]))
        nms_scores = np.concatenate((scores, [2. for track in tracks]))
        max_output_size = len(nms_scores)

        # Runs NMS to find which boxes to keep.
        keep = tf.image.non_max_suppression(nms_boxes,
                                            nms_scores,
                                            max_output_size,
                                            iou_threshold=NMS_IOU_THRESH,
                                            score_threshold=NMS_SCORE_THRESH)
        keep = keep.numpy()

        # Finds which boxes to remove.
        to_delete = []
        for i, _ in enumerate(boxes):
            if i not in keep:
                to_delete.append(i)
    
        # Deletes boxes.
        boxes = np.delete(boxes, to_delete, 0)
        classes = np.delete(classes, to_delete)
        scores = np.delete(scores, to_delete, 0)
    
    # Reidentification would go here.

    # Instantiates new tracks.
    for box, cls, score in zip(boxes, classes, scores):
        id = len(tracks) + len(dead_tracks)
        track = Track(id, box, cls, score)
        tracks.append(track)

    # Updates tracks_per_frame with this frame's results.
    tracks_per_frame.append(deepcopy(tracks))

## Visualization

Failure cases are mostly due to using the pretrained weights without fine-tuning. There aren't many flying dogs in COCO-17!

In [None]:
vis_images = deepcopy(images)
# Writes each image to disk with tracking boxes.
for i, (tracks, vis_image) in enumerate(zip(tracks_per_frame, vis_images)):
    # Gets tracking information for the image.
    ids = np.array([track.id for track in tracks])
    boxes = np.array([track.box for track in tracks])
    classes = np.array([track.cls + LABEL_ID_OFFSET for track in tracks])
    scores = np.array([track.score for track in tracks])

    # Plots tracking boxes on the image.
    visualize(
        vis_image,
        boxes,
        classes,
        scores,
        category_index,
        track_ids=ids,
        min_score_thresh=REG_CONF_THRESH,
        line_thickness=2,
        use_normalized_coordinates=True)

# Saves and displays results as a GIF.
imageio.plugins.freeimage.download()
gif_name = 'tracktor.gif'
imageio.mimsave(gif_name, vis_images, 'GIF-FI', fps=15)
display(IPyImage(open(gif_name, 'rb').read()))