<a href="https://colab.research.google.com/github/youssefokeil/HiFT-reimplementation/blob/master/HiFT_reimplementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hierarchical Feature Transformer Reimplementation

In this notebook, we’ll *recreate* a Visual Tracking method that is outlined in the paper, [HiFT: Hierarchical Feature Transformer for Aerial Tracking](https://ieeexplore.ieee.org/document/9710895) in PyTorch.

In this paper, visual tracking uses the features extracted from an AlexNet, which is comprised of a series of convolutional and pooling layers, and a few fully-connected layers.

<img src='https://github.com/youssefokeil/HiFT-reimplementation/blob/master/Notebook_Images/HiFT-overview.png?raw=true' width=80% />

### The Hierarchical Feature Transformer

Using the transformer layer, we use the extracted features from AlexNet to make a high-resolution feature encoding and low-resolution feature decoding. The  high-resolution encoding layer is to learn interdependencies between different feature layers and to raise attention to objkect with different scales. The low-resolution decoder gets the semantic informatiion from  the low-level feature map.

The model then uses a regression and a classification network as the head of the model. Looking at the image below, you can understand more the approach of the paper.

We'll use the requirements specified in the github repo of the research group.

In [1]:
!pip install opencv-python yacs tqdm colorama cython;

Collecting yacs
  Downloading yacs-0.1.8-py3-none-any.whl.metadata (639 bytes)
Collecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading yacs-0.1.8-py3-none-any.whl (14 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: yacs, colorama
Successfully installed colorama-0.4.6 yacs-0.1.8


Note to self: From the resources of the HiFT, pyyaml is old and I'm not sure if we have to use it.

In [2]:
# import resources
%matplotlib inline

# import pyyaml
import yacs
import tqdm
import colorama
import cython
#import tensorboardX
import matplotlib.pyplot as plt
import numpy as np


import torch
import torch.optim as optim
import torch.nn as nn
import requests
from torchvision import transforms, models

## The AlexNet Backbone

While the AlexNet can be found anywhere, we'll write it ourselves for (mostly) learning purposes and as we can change our model and try different architecture, it's better if we see the configuration ourselves.

Take a glimpse at the AlexNet architecture, we'll only need the convolutional and pooling layers. Also, following the source code of the *HiFT* we may use batchnorm layers between each convolutional stack.

<img src='https://upload.wikimedia.org/wikipedia/commons/c/cc/Comparison_image_neural_networks.svg' width=80% />


In [3]:
class AlexNet(nn.Module):
  def __init__(self):
      super(AlexNet, self).__init__()

      ## we'll not use  batchnorm and then we'll try it

      self.layer1 = nn.Sequential(
          nn.Conv2d(3,96,kernel_size=11, stride=2),
          nn.MaxPool2d(kernel_size=3, stride=2),
          nn.ReLU()
      )

      self.layer2 = nn.Sequential(
          nn.Conv2d(96, 256, kernel_size=5, stride=2),
          nn.MaxPool2d(kernel_size=3, stride=2),
          nn.ReLU()
      )

      self.layer3 = nn.Sequential(
          nn.Conv2d(256, 384, kernel_size=3, stride=1),
          nn.ReLU()
      )

      self.layer4 = nn.Sequential(
          nn.Conv2d(384, 384, kernel_size=3, stride=1),
          nn.ReLU()
      )

      self.layer5 = nn.Sequential(
          nn.Conv2d(384, 256, kernel_size=3, stride=1)
      )

      def forward(self, x):
        x=self.layer1(x)
        x=self.layer2(x)
        x=self.layer3(x)
        x=self.layer4(x)
        x=self.layer5(x)

        return x

Check which device you're running. Make sure it's `CUDA`

In [4]:
# move the model to GPU, if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device

device(type='cpu')

### Transformer Implementation

Looking at the paper, they use a different implementation of the transformer which demands that for me to understand transformers pretty well. Or just copy paste the code. I'll choose the former,

In [6]:
# TODOS: Hierarchical Transformer definition

## Regression (Localization) & Classification Network Head

Now we define the regression and classification heads. The point of this network is:


*   **The regression network**; predicts where the object is located. It does this by outputting 4 values that define the bounding box parameters.
*   **The classification network**; determines if the object is present at each location. Outputs a probability score of the confidence it has of the presence of the object.

We combine both high classification scores & the accurate localization to achieve successful tracking.

In the implementation of the paper, they used two classification labels


> To achieve accurate
classiﬁcation, we apply two classiﬁcation branches. One
branch aims to classify via the area involved in the ground
truth box. The other branch focuses on determining the positive samples measured by the distance between the center
of ground truth and the corresponding point.

 *HiFT 2021*



In [5]:
class ModelHead(nn.Module):
  def __init__(self):
      super(ModelHead, self).__init__()

      # uses "same convolution" for spatial awareness & to learn some edges
      self.conv_layer1=nn.Sequential(
          nn.Conv2d(192, 192, kernel_size=3, stride=1, padding=1),
          nn.BatchNorm2d(192),
          nn.ReLU()
      )

      # outputs 4 values corresponding to the bounding box
      self.layer_loc = nn.Conv2d(192, 4)

      self.cls1 = nn.Conv2d(192, 2, kernel_size=3, stride=1, padding=1)
      self.cls2 = nn.Conv2d(192, 1, kernel_size=3, stride=1, padding=1)

  def forward(self, x):

        # notice that you can call layer 1 multiple times, in the implementation
        # of the paper they called it 3 times, now we can experiment with
        # different calls, now let's call only once
        loc = self.conv_layer1(x)
        loc = self.layer_loc(loc)

        cls = self.layer1(x)

        # first classification branch
        cls1=self.cls1(cls)

        #second classification branch
        cls2=self.cls2(cls)


        return loc, cls1, cls2

## Loss Function

As the output of the function has three values. The loss is a mix of the the three outputs.



\ Loverall = λ1 Lcls1 + λ2 Lcls2 + λ3 Lloc

> where Lcls1 , Lcls2 , Lloc represent the cross-entropy, binary
cross-entropy, and IoU loss. λ1 , λ2 , and λ3 are the coefﬁ-
cients to balance the contributions of each loss


## Putting it all Together

Now let's put all the elements together and train our model