# CSE527 Final - Problem 6 (Visual Object Tracking, 25 points)
**Due date: 23:59 on May 18, 2021 (Tuesday)**

**NO LATE SUBMISSION is allowed for Final**

---
In this semester, we will use Google Colab for the assignments, which allows us to utilize resources that some of us might not have in their local machines such as GPUs. You will need to use your Stony Brook (*.stonybrook.edu) account for coding and Google Drive to save your results.

## Google Colab Tutorial
---
Go to https://colab.research.google.com/notebooks/, you will see a tutorial named "Welcome to Colaboratory" file, where you can learn the basics of using google colab.

Settings used for assignments: ***Edit -> Notebook Settings -> Runtime Type (Python 3)***.


## Local Machine Prerequisites
---
Since we are using Google Colab, all the code is run on the server environment where lots of libraries or packages have already been installed. In case of missing 
 libraries or if you want to install them in your local machine, below are the links for installation.
* **Install Python 3.6**: https://www.python.org/downloads/ or use Anaconda (a Python distribution) at https://docs.continuum.io/anaconda/install/. Below are some materials and tutorials which you may find useful for learning Python if you are new to Python.
  - https://docs.python.org/3.6/tutorial/index.html
  - https://www.learnpython.org/
  - http://docs.opencv.org/3.0-beta/doc/py_tutorials/py_tutorials.html
  - http://www.scipy-lectures.org/advanced/image_processing/index.html


* **Install Python packages**: install Python packages: `numpy`, `matplotlib`, `opencv-python` using pip, for example:
```
pip install numpy matplotlib opencv-python
``` 
	Note that when using “pip install”, make sure that the version you are using is python3. Below are some commands to check which python version it uses in you machine. You can pick one to execute:
  
```  
    pip show pip
    pip --version
    pip -V
```

Incase of wrong version, use pip3 for python3 explictly.

* **Install Jupyter Notebook**: follow the instructions at http://jupyter.org/install.html to install Jupyter Notebook and familiarize yourself  with it. *After you have installed Python and Jupyter Notebook, please open the notebook file 'HW1.ipynb' with your Jupyter Notebook and do your homework there.*

---
## Tracking algorithm description

This problem is based on SiamFC [1]. The basic idea of [1] is to learn a similarity measurement using deep Siamese network. With the learned similarity measurement, both target localization and target scale estimation will be performed on it. In addition, no model update is performed during tracking. **More details can be found in the original paper [1].**

## Your task

In this assignment, the SiamFC tracking algorithm will be given with several parts incompleted (e.g., the function loading sequence for tracking). You need to complete these parts and make the full tracking algorithm work. The incompleted parts will be like this:

\# description of requirement for this part

\#------------- WRITE YOUR OWN CODE -------------\#

...

\#------------- END OF YOUR OWN CODE-------------\#

**NOTE: the pretrained SiamFC model is given, and you just need to use the pretrained model to finished the tracking algorithm.**

[1] Fully-Convolutional Siamese Networks for Object Tracking, ECCVW, 2016.



#Environment set up and data preparation



1.   **Install PyTorch 1.1.0 and Pillow 6.0.0.**

  Please install PyTorch 1.1.0 using the following command (By default, latest version of PyTorch is installed on CoLab. Here, we want to use an old version). In addition, you also need to downgrade the Pillow to verion 6.0.0. 
  
  **After finishing, remember to restart runtime, otherwise you may get errors.**




In [1]:
# install pytorch
!pip install torch===1.1.0 torchvision==0.4.0 -f https://download.pytorch.org/whl/torch_stable.html

# downgrade pillow
!pip install pillow==6.0.0

# after executing the above commands, remember to restart runtime

Looking in links: https://download.pytorch.org/whl/torch_stable.html


2.   **Download data and model**

  link for data: https://drive.google.com/open?id=1GH1mDaS6vWXcypbXxqjVxtcQM0XY28lS;

  link for model: https://drive.google.com/open?id=1XhmZWFxth4JDbf_Pd_qezEVRu-D_UPZ8;

  link for utility codes: https://drive.google.com/open?id=1lUh8GO-wX0rxaxJGLefpute7_YK1Sz7H and https://drive.google.com/open?id=1A_KbyWcOpl-6cEW7zP0Ub-fE1dVjZSgE;

  **You need to manually download the above files to your local, and then use the following command to upload all of them.**

In [2]:
from google.colab import files
uploaded = files.upload()

Saving Config.py to Config (1).py
Saving SiamFCMModel.pth to SiamFCMModel (1).pth
Saving SiamFCVideo.zip to SiamFCVideo.zip
Saving SiamNet.py to SiamNet.py


3. **Unzip the video sequence using the following command.**

In [3]:
# unzip video data from the zip file
!unzip SiamFCVideo.zip

Archive:  SiamFCVideo.zip
   creating: SiamFCVideo/
  inflating: SiamFCVideo/groundtruth_rect.txt  
   creating: SiamFCVideo/img/
  inflating: SiamFCVideo/img/0001.jpg  
  inflating: SiamFCVideo/img/0002.jpg  
  inflating: SiamFCVideo/img/0004.jpg  
  inflating: SiamFCVideo/img/0005.jpg  
  inflating: SiamFCVideo/img/0006.jpg  
  inflating: SiamFCVideo/img/0008.jpg  
  inflating: SiamFCVideo/img/0009.jpg  
  inflating: SiamFCVideo/img/0010.jpg  
  inflating: SiamFCVideo/img/0011.jpg  
  inflating: SiamFCVideo/img/0012.jpg  
  inflating: SiamFCVideo/img/0013.jpg  
  inflating: SiamFCVideo/img/0014.jpg  
  inflating: SiamFCVideo/img/0016.jpg  
  inflating: SiamFCVideo/img/0017.jpg  
  inflating: SiamFCVideo/img/0018.jpg  
  inflating: SiamFCVideo/img/0019.jpg  
  inflating: SiamFCVideo/img/0021.jpg  
  inflating: SiamFCVideo/img/0023.jpg  
  inflating: SiamFCVideo/img/0024.jpg  
  inflating: SiamFCVideo/img/0026.jpg  
  inflating: SiamFCVideo/img/0028.jpg  
  inflating: SiamFCVideo/img/0

In [4]:
# check if successfully uploading the files
# should show: Config.py, SiamFCModel.pth, SiamFCVideo SiamFCVideo.zip SiamNet.py
!ls

'Config (1).py'   sample_data		  SiamFCMModel.pth   SiamFCVideo.zip
 Config.py	 'SiamFCMModel (1).pth'   SiamFCVideo	     SiamNet.py


# SiamFC Tracking Algorithm

**Some important instructions:**


*   After finishing preparing the data, it ie better to check each of them;
*   The network of SiamFC is defined in *SiamNet.py*, and you can check it to get familar with the architecture of SiamFC;
*   There are some parameters defined in the Config.py, and you do NOT need to care them;


**NOTE 1: before running the code, please change runtime type to GPU**

**NOTE 2: There are many available resources about SiamFC online, and you should finish this assignment on your own. Do NOT copy-and-pase!!!**

In [6]:
# Code for SiamFC tracking algorithm
import torchvision.transforms.functional as F
import cv2
from torch.autograd import Variable
import torch
import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import glob
from Config import *
from IPython.display import clear_output
import copy
import math

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def cat_img(image_cat1, image_cat2, image_cat3):
    """
    concatenate three 1-channel images to one 3-channel image
    """
    image = np.zeros(shape = (image_cat1.shape[0], image_cat1.shape[1], 3), dtype=np.double)
    image[:, :, 0] = image_cat1
    image[:, :, 1] = image_cat2
    image[:, :, 2] = image_cat3

    return image


def load_sequence(seq_path):
    """
    load sequences;
    sequences should be in OTB format, or you can custom this function by yourself
    """
    # given the sequence path, you need to load sequence information, including
    # img_list: a list, each element of which is the path for an image
    # target_position: an array, the center position of the target [center_y, center_x]
    # target_sz: an array, the height and width of the target [h, w]
    # NOTE: the sequence folder contains a sub-folder that contains all image files, 
    # and a 'groundtruth_rect.txt' that contains the initial groundtruth of the target in format (top_left_x, top_left_y, width, height)

    # ------ WRITE YOUR OWN CODE (5 points) ------#
    seq_path2 = copy.deepcopy(seq_path)
    img_list = []
    for filename in glob.glob(os.path.join(seq_path, '*/*.jpg')):
      img_list.append(filename)

    ground_truth = seq_path2 + '/groundtruth_rect.txt'

    target_sz = [] 
    target_position = []
    with open(ground_truth, 'r') as f:
      data = f.readline().split(",")
      
      target_sz.append(int(data[-1]))
      target_sz.append(int(data[-2]))
      
      center_y = int(data[1])+round(int(data[-1])/2)
      center_x = int(data[0])+round(int(data[-2])/2)
      target_position.append(center_y)
      target_position.append(center_x)
      #for x in f:
      #  print(x)
#      print(f.read())
    target_sz = np.array(target_sz)
    #print(target_sz.shape)
    #print(target_sz)

#    img_list = None
    #target_position = None
    #target_sz = None
    # ------------ END OF YOUR OWN CODE ------------#

    return img_list, target_position, target_sz


def visualize_tracking_result(img, bbox, fig_n):
    """
    visualize tracking result
    """
    fig = plt.figure(fig_n)
    ax = plt.Axes(fig, [0., 0., 1., 1.])
    ax.set_axis_off()
    fig.add_axes(ax)
    r = patches.Rectangle((bbox[0], bbox[1]), bbox[2], bbox[3], linewidth = 3, edgecolor = "#00ff00", zorder = 1, fill = False)
    ax.imshow(img)
    ax.add_patch(r)
    plt.ion()
    plt.show()
    plt.pause(0.00001)
    plt.clf()


def get_subwindow_tracking(im, pos, model_sz, original_sz, avg_chans):
    """
    extract image crop
    """
    if original_sz is None:
        original_sz = model_sz

    sz = original_sz
    im_sz = im.shape
    # make sure the size is not too small
    assert (im_sz[0] > 2) & (im_sz[1] > 2), "The size of image is too small!"
    c = (sz+1) / 2

    # check out-of-bounds coordinates, and set them to black
    context_xmin = round(pos[1] - c)       # floor(pos(2) - sz(2) / 2);
    context_xmax = context_xmin + sz - 1
    context_ymin = round(pos[0] - c)       # floor(pos(1) - sz(1) / 2);
    context_ymax = context_ymin + sz - 1
    left_pad = max(0, 1-context_xmin)       # in python, index starts from 0
    top_pad = max(0, 1-context_ymin)
    right_pad = max(0, context_xmax - im_sz[1])
    bottom_pad = max(0, context_ymax - im_sz[0])

    context_xmin = context_xmin + left_pad
    context_xmax = context_xmax + left_pad
    context_ymin = context_ymin + top_pad
    context_ymax = context_ymax + top_pad

    im_R = im[:, :, 0]
    im_G = im[:, :, 1]
    im_B = im[:, :, 2]

    # padding
    if (top_pad !=0) | (bottom_pad !=0) | (left_pad !=0) | (right_pad !=0):
        im_R = np.pad(im_R, ((int(top_pad), int(bottom_pad)), (int(left_pad), int(right_pad))), 'constant', constant_values = avg_chans[0])
        im_G = np.pad(im_G, ((int(top_pad), int(bottom_pad)), (int(left_pad), int(right_pad))), 'constant', constant_values = avg_chans[1])
        im_B = np.pad(im_B, ((int(top_pad), int(bottom_pad)), (int(left_pad), int(right_pad))), 'constant', constant_values = avg_chans[2])

        im = cat_img(im_R, im_G, im_B)

    im_patch_original = im[int(context_ymin)-1:int(context_ymax), int(context_xmin)-1:int(context_xmax), :]

    if model_sz != original_sz:
        im_patch = cv2.resize(im_patch_original, (int(model_sz), int(model_sz)), interpolation = cv2.INTER_CUBIC)
    else:
        im_patch = im_patch_original

    return im_patch


def make_scale_pyramid(im, target_position, in_side_scaled, out_side, avg_chans, p):
    """
    extract multi-scale image crops
    """
    in_side_scaled = np.round(in_side_scaled)
    pyramid = np.zeros((out_side, out_side, 3, p.num_scale), dtype = np.double)
    max_target_side = in_side_scaled[in_side_scaled.size-1]
    min_target_side = in_side_scaled[0]
    beta = out_side / min_target_side
    # size_in_search_area = beta * size_in_image
    # e.g. out_side = beta * min_target_side
    search_side = round(beta * max_target_side)

    search_region = get_subwindow_tracking(im, target_position, search_side, max_target_side, avg_chans)

    assert (round(beta * min_target_side) == out_side), "Error!"

    # extract multiple pyramid patches using get_subwindow_tracking() function;
    # you should use a loop to do this;
    # the number of scales is indicated by p.num_scale;
    # the scale information is stored in in_side_scaled[];
    # the obtained pryamid is represented in variable 'pyramid';
    
    # ------ WRITE YOUR OWN CODE (5 points) ------#
    #

    temp = search_side/2
    targ = np.array([1 + temp, 1 + temp], dtype = np.float64)

    for i in range(0,p.num_scale):
      targ_side =  in_side_scaled[i]*beta
      targ_side = round(targ_side)
      search_region2 = get_subwindow_tracking(search_region, targ, out_side, targ_side, avg_chans)
      pyramid[:,:,:,i] = search_region2
    # ------------ END OF YOUR OWN CODE ------------#

    return pyramid


def tracker_eval(net, s_x, z_features, x_features, target_position, window, p):
    """
    do evaluation (i.e., a forward pass for search region)
    """
    # compute scores search regions of different scales
    scores = net.xcorr(z_features, x_features)
    scores = scores.to("cpu")

    response_maps = scores.squeeze().permute(1, 2, 0).data.numpy()
    # for this one, the opencv resize function works fine
    response_maps_up = cv2.resize(response_maps, (response_maps.shape[0]*p.response_UP, response_maps.shape[0]*p.response_UP), interpolation=cv2.INTER_CUBIC)

    # choose the scale whose response map has the highest peak
    if p.num_scale > 1:
        current_scale_id =np.ceil(p.num_scale/2)
        best_scale = current_scale_id
        best_peak = float("-inf")
        for s in range(p.num_scale):
            this_response = response_maps_up[:, :, s]
            # penalize change of scale
            if s != current_scale_id:
                this_response = this_response * p.scale_penalty
            this_peak = np.max(this_response)
            if this_peak > best_peak:
                best_peak = this_peak
                best_scale = s
        response_map = response_maps_up[:, :, int(best_scale)]
    else:
        response_map = response_maps_up
        best_scale = 1
    # make the response map sum to 1
    response_map = response_map - np.min(response_map)
    response_map = response_map / sum(sum(response_map))

    # apply windowing
    response_map = (1 - p.w_influence) * response_map + p.w_influence * window
    p_corr = np.asarray(np.unravel_index(np.argmax(response_map), np.shape(response_map)))

    # avoid empty
    if p_corr[0] is None:
        p_corr[0] = np.ceil(p.score_size/2)
    if p_corr[1] is None:
        p_corr[1] = np.ceil(p.score_size/2)

    # Convert to crop-relative coordinates p_corr to frame coordinates
    # the frame coordinates are represented using new_target_position (it is an array)

    # ------ WRITE YOUR OWN CODE (10 points) ------#

    temp = p.response_UP/2
    temp = temp * p.score_size
    temp = math.ceil(temp)
    disp_f = -temp + p_corr

    temp2 = p.stride*(1.0/p.response_UP)
    disp_in = temp2*disp_f

    temp3 = s_x*(1.0/p.instance_size)
    disp_fr = temp3*disp_in

    new_target_position = disp_fr + target_position
    #new_target_position = displace 
    #------------ END OF YOUR OWN CODE -------------#

    return new_target_position, best_scale


if __name__ == "__main__":

    # get the default parameters
    p = Config()

    # load model
    net = torch.load('./SiamFCMModel.pth')
    net = net.to(device)

    # evaluation mode
    net.eval()

    # load sequence
    img_list, target_position, target_size = load_sequence('./SiamFCVideo')

    # first frame
    img_uint8 = cv2.imread(img_list[0])
    img_uint8 = cv2.cvtColor(img_uint8, cv2.COLOR_BGR2RGB)
    img_double = np.double(img_uint8)  # uint8 to float

    # compute avg for padding
    avg_chans = np.mean(img_double, axis=(0, 1))

    wc_z = target_size[1] + p.context_amount * sum(target_size)
    hc_z = target_size[0] + p.context_amount * sum(target_size)
    s_z = np.sqrt(wc_z * hc_z)
    scale_z = p.examplar_size / s_z

    # crop examplar z in the first frame
    z_crop = get_subwindow_tracking(img_double, target_position, p.examplar_size, round(s_z), avg_chans)

    z_crop = np.uint8(z_crop)  # you need to convert it to uint8
    # convert image to tensor
    z_crop_tensor = 255.0 * F.to_tensor(z_crop).unsqueeze(0)

    d_search = (p.instance_size - p.examplar_size) / 2
    pad = d_search / scale_z
    s_x = s_z + 2 * pad
    # arbitrary scale saturation
    min_s_x = p.scale_min * s_x
    max_s_x = p.scale_max * s_x

    # generate cosine window
    if p.windowing == 'cosine':
        window = np.outer(np.hanning(p.score_size * p.response_UP), np.hanning(p.score_size * p.response_UP))
    elif p.windowing == 'uniform':
        window = np.ones((p.score_size * p.response_UP, p.score_size * p.response_UP))
    window = window / sum(sum(window))

    # pyramid scale search
    scales = p.scale_step ** np.linspace(-np.ceil(p.num_scale / 2), np.ceil(p.num_scale / 2), p.num_scale)

    # extract feature for examplar z
    z_features = net.feat_extraction(Variable(z_crop_tensor).to(device))
    z_features = z_features.repeat(p.num_scale, 1, 1, 1)

    # do tracking
    bboxes = np.zeros((len(img_list), 4), dtype=np.double)  # save tracking result
    for i in range(0, len(img_list)):
        print('processing frame %d ...' %(i+1))
        if i > 0:
            # do detection
            # currently, we only consider RGB images for tracking
            img_uint8 = cv2.imread(img_list[i])
            img_uint8 = cv2.cvtColor(img_uint8, cv2.COLOR_BGR2RGB)
            img_double = np.double(img_uint8)  # uint8 to float

            scaled_instance = s_x * scales
            scaled_target = np.zeros((2, scales.size), dtype=np.double)
            scaled_target[0, :] = target_size[0] * scales
            scaled_target[1, :] = target_size[1] * scales
            # extract scaled crops for search region x at previous target position
            x_crops = make_scale_pyramid(img_double, target_position, scaled_instance, p.instance_size, avg_chans, p)

            # get features of search regions
            x_crops_tensor = torch.FloatTensor(x_crops.shape[3], x_crops.shape[2], x_crops.shape[1], x_crops.shape[0])
            # response_map = SiameseNet.get_response_map(z_features, x_crops)
            for k in range(x_crops.shape[3]):
                tmp_x_crop = x_crops[:, :, :, k]
                tmp_x_crop = np.uint8(tmp_x_crop)
                # numpy array to tensor
                x_crops_tensor[k, :, :, :] = 255.0 * F.to_tensor(tmp_x_crop).unsqueeze(0)

            # get features of search regions;
            # the input to the network is x_crop_tensors;
            # the feature of search regions is represented using 'x_features'

            # ------- WRITE YOUR OWN CODE (5 points) -------#
            temp = Variable(x_crops_tensor)
            temp = temp.to(device)
            x_features = net.feat_extraction(temp)

            # ------------ END OF YOUR OWN CODE ------------ #

            # evaluate the offline-trained network for exemplar x features
            target_position, new_scale = tracker_eval(net, round(s_x), z_features, x_features, target_position, window,
                                                      p)

            # scale damping and saturation
            s_x = max(min_s_x, min(max_s_x, (1 - p.scale_LR) * s_x + p.scale_LR * scaled_instance[int(new_scale)]))
            target_size = (1 - p.scale_LR) * target_size + p.scale_LR * np.array(
                [scaled_target[0, int(new_scale)], scaled_target[1, int(new_scale)]])

        rect_position = np.array(
            [target_position[1] - target_size[1] / 2, target_position[0] - target_size[0] / 2, target_size[1],
             target_size[0]])

        # visualize the tracking results
        # comment for not visualizing
        # could be really slow on CoLab
        # if True:
        #     visualize_tracking_result(img_uint8, rect_position, 1)

        # output bbox in the original frame coordinates
        o_target_position = target_position
        o_target_size = target_size
        bboxes[i, :] = np.array(
            [o_target_position[1] - o_target_size[1] / 2, o_target_position[0] - o_target_size[0] / 2, o_target_size[1],
             o_target_size[0]])
    
    # save tracking results
    np.savetxt('./Tracking_Res.txt', bboxes, fmt='%.3f')





processing frame 1 ...
processing frame 2 ...
processing frame 3 ...
processing frame 4 ...
processing frame 5 ...
processing frame 6 ...
processing frame 7 ...
processing frame 8 ...
processing frame 9 ...
processing frame 10 ...
processing frame 11 ...
processing frame 12 ...
processing frame 13 ...
processing frame 14 ...
processing frame 15 ...
processing frame 16 ...
processing frame 17 ...
processing frame 18 ...
processing frame 19 ...
processing frame 20 ...
processing frame 21 ...
processing frame 22 ...
processing frame 23 ...
processing frame 24 ...
processing frame 25 ...
processing frame 26 ...
processing frame 27 ...
processing frame 28 ...
processing frame 29 ...
processing frame 30 ...
processing frame 31 ...
processing frame 32 ...
processing frame 33 ...
processing frame 34 ...
processing frame 35 ...
processing frame 36 ...
processing frame 37 ...
processing frame 38 ...
processing frame 39 ...
processing frame 40 ...
processing frame 41 ...
processing frame 42 ...
p

**After finishing the required parts and successfully running the code, please submit your code and the obtained tracking results in Tracking_Res.txt.**

## Submission guidelines


Please refer to previous guidelines for homework submission.

**Late submission penalty:** <br>
**NO LATE SUBMISSION is allowed for final.**
