# End-to-End Object Detection with Transformers 

(https://arxiv.org/abs/2005.12872)

## Output positional encodings (object queries) 

"Output positional encodings are required and cannot be removed, so we experiment with either passing them once at decoder input or adding to queries at every decoder attention layer."

<img src="arXiv_2005_12872/fig2.png">

<img src="arXiv_2005_12872/fig3.png">

* I think the purpose of the output position encoding is to introduce some kind of spatial masking or mutual exclusiveness between the N attention maps each of which is supposed to include a different object. (see attention maps in Fig. 8)

<img src="arXiv_2005_12872/fig8.png">

That is why the output position encodings/object queries need to be trained/learned.
So, the N object queries are trying to be mutually exclusive (should be plotted and verified).
We can perhaps try using sine, cosine position encodings with their frequency as a function of i- encoding dim and n (n-th object query out of N object queries), so that the object queries are orthogonal to each other. That way the encoder will learn to map differnt objects (and non-objects) to one of the N orthogonal basis.

# Positional Encoding

looking at: 
- https://arxiv.org/pdf/1802.05751.pdf (for 2D version used in the paper)
- https://arxiv.org/pdf/1904.07418.pdf
- https://nlp.seas.harvard.edu/2018/04/03/attention.html#positional-encoding
- http://jalammar.github.io/illustrated-transformer/
- https://arxiv.org/pdf/1706.03762.pdf

from https://arxiv.org/abs/2005.12872

"Specifically, for both spatial coordinates of each embedding we independently used d/2 sine and cosine functions with different frequencies. We then concatenate them to get the final d channel positional encoding."

# R-CNN (“Region-based CNN”) family



| Model | Goal | Resources |
|-------|------|-----------|
| R-CNN | Object recognition | [[paper](https://arxiv.org/abs/1311.2524)] [[code](https://github.com/rbgirshick/rcnn)]|
| Fast R-CNN | Object recognition | [[paper](https://arxiv.org/abs/1504.08083)] [[code](https://github.com/rbgirshick/fast-rcnn)]|
| Faster R-CNN | Object recognition | [[paper](https://arxiv.org/abs/1506.01497)] [[code](https://github.com/rbgirshick/py-faster-rcnn)]|
| Mask R-CNN | Image Segmentation | [[paper](https://arxiv.org/abs/1703.06870)] [[code](https://github.com/CharlesShang/FastMaskRCNN)]|

Reading up: https://lilianweng.github.io/lil-log/2017/12/31/object-recognition-for-dummies-part-3.html

# R-CNN

- select ~2000 region candidates (RoI) using selective search (to read: https://lilianweng.github.io/lil-log/2017/10/29/object-recognition-for-dummies-part-1.html#selective-search)
    * after selective search, unique boxes for objects are chosen using non-maximum suppression
- warp each box for fixed input size to run a ConvNet
- ConvNet features classified using an SVM (is SVM not a specific(binary) case as softmaax?) into object classes.
- At the end offset correction is made for the predicted boxes and the ground truth boxes using regression when the predicted boxes have a descent overlap with the ground truth boxes.

# Fast R-CNN

    - To avoid running ConvNet on each box separetely,ConvNet is first run on the entire image.
    - Then object (RoI) are chosen on the ConvNet output (activation map) using selective search.
        * this is the bottleneck step now
    - then RoI warped using RoI pooling and fed to FCN to produce feature vector
    - feature vector classified using softmax into K+1 (extra one for bkg) object classes and box regression to adjusting to predicted box to match the ground truth box.

# Faster R-CNN

To avoid the selective search object box step which is bottleneck, this trains a network and replaces selective search.

rest is the same as Fast R-CNN

# Mask R-CNN

does image seg inside object boxes as well.

<img src = "https://lilianweng.github.io/lil-log/assets/images/rcnn-family-summary.png">

# YOLO
