# Notes on Template Matching By CNN

## Template Matching Survey

### Area-Based Approach
Compute correlation of intensities, e.g., NCC

### Feature-Based Approach
Genreate feature map first, e.g., gradient. Extract edges and only match nearby pixels instead of whole pattern. Generally, it is the gradient direction of edge pixels, rather than intensity of edge pixels are matched.

To achieve rotation and scale invariance, multi-angle and multi-scale matching are performed.

## Try MatchNet

Use feature tower of MatchNet and fix the weights trained by zixin, but train our own bottleneck and metric. The bottleneck first convert the dimension of feature map to 512x1x1 by **fully convolution layer** which functions the same as fc layer. Then the metric also leverages the **fully convolution layer** with **kernel size equal to 1** and produces the final softmax estimation of match confidence 2x1x1.

We train and test on the data where positive patch pairs are the same and negative pairs are the same. The accuracy reaches 99.9% because both the positive and negative samples are very simple.

Multi-view patch pairs with rotations are also trained and tested. The accuracy is also good but not discriminative any more. Rotation invariance can be achieved.

### Add small translation offsets
To incorporate some variance for better localization, I add small offsets of translation varying from 0 to 4. In training, first fix the feature tower and only train the bottleneck layer and the metric.

## MatchNet+Regression

Then we try to add sibling regression network to estimate the transformation parameters between patch pairs while maintaining the discrination of metric.

The training strategy generally follows:
* Fix feature towers' weights and train the bottleneck layer and metric on positive and negative patch pairs. The positive patch pairs undergoes handmade transformations between them.
* Fix bottleneck and metric. Train the regression branch.
* Relax the bottleneck layer and the sibling metric and regression branches and train all of them with adjusted loss weights
* Finally relax the feature towers if necessary

The weights are fixed by setting below option to top layers. Repeat m times if there are m inputs.

In [None]:
propagate_down: false

### Aspect Ratio+Scale
The first case of transformation is aspect ratio and scale (therefore, dof = 2). The scale ratios in x and y dimension vary from 1 to 2, thus the aspect ratio varies from 1/2 to 2. Since the scale ratio is no less than 1, the trasformed patches have smaller sizes. The regression results are quite good.
![regression with aspect ratio and scale](images/regression_aspect_ratio_and_scale.jpg)

### Aspect Ratio+Scale+Translation
Besides the variance of aspect ratio and scale, variation of translation offsets are added (therefore, dof = 4). The offsets in x and y dimension vary from zero to the width or height of transformed smaller patch.

Report on traning:
* Fix feature towers, bottleneck layer and regression branch, just train the metric branch. The softmax loss reaches 0.12
* Fix feature towers and regression branch, train the bottleneck layer and metric branch. The softmax loss reaches 0.06 and the accuracy reaches 97%.
<span style="color:red">(Although the accuracy of pairwise patches is high, the locating accruacy is very low. The confidences of positve subpatches are very low. I have not figure out the reason.)</span>

## Template Matching

The goal of template matching is to locate the position of patches assuming similar pattern to the template within the search image. For high-quality, a number of variations like scale change, skew, subtle translation must be overcome. As the first step, we begin with locating exactly the same patch extracted from the search image and then add small variations to it.

Originally, we insist that the key to accurate localization is to optimize the MatchNet model because it can provide discriminative similarity metric between the query template and the sub-patches in the search images. However, after passing through the feature tower, the resultant feature map of target patch is slightly different from that of the template patch in that some intensity values from neighboring pixels outside the target patch are absorbed. Therefore, training the MatchNet will not directly optimize the similarity metric between the template and target patch in search image. And this is the reason that bias is present when apply the trained MatchNet to template matching task, although MatchNet has achieved good quality in matching patch pairs.

To directly optimize the network of template matching, we still follow the MatchNet architecture but train from the feature map of template and the target patch in search images that have passed through the feature tower. And the gap is indeed bridged. First we trained on data in which the templates have no offsets with respect to the grid of search images. When only optimizing the bottleneck and the metric network, the softmax loss dropped to **0.06**. Then, the weights of feature towers are relaxed and the loss further decreases to **0.024**.

### Translation Invariance

When the query patch is not exactly the grid of search images, the localization using above learnt network degenrate to some extent. Therefore, transaltion variations (-3~3 pixels) are introduced into training data to boost localization performance.

## Image Matching Framework

1. **Detection** Referring from the saliency map, give each patch box a saliency score.
2. **Template Matching**
3. **Feature Matching Across Patches**

### Some Reflections

Essentially, the matching score of two images passing through CNN networks is computed by function $$Score = f(I_1, I_2, w),$$ where the function $f(.)$ in highly nonlinear. For two matched images, the overlapping area is quited correlated. And it is the correlation that contribute to the final similarity score, e.g., $f(i1, i2)$, where $i1$ and $i2$ represent local patches. The objective of overlap detection or matching is similar to discovering the most correlated pixles or local patches. 


[1] Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps