- Main idea: A deep network, termed AncorNet produces image representations that are well-suited for semantic matching.
- Tasks:
- Cross-instance matching: different instances of the same object category are matched.
- Cross-category matching: align pairs of instances each from a different object class.
- Introduce a new deep architecture that:
- learn representations that work well for cross-instance matching.
- using only image-level labels.
- contains a set of sparse and diverse filters.
-
Hypercolumns: A hypercolumn d(u,v) at location (u,v) in the image is created by concatenating the convolutional feature responses at that location for different layers of the network.
-
Then train 3x3 convolutional filters F1, F2,..., FK that compress the hypercolumns into a compact set of anchor filters that are suitable for matching. These filters are trained with two constraints:
-
Discriminability constraints. Discriminability is encouraged by minimizing the following loss function:
This constraint basically means the response of the filters should be discrinative to do classification. A secondary loss is introduced only for negative samples to speed up training:
-
Diversity constraints. Two losses are used to maximizing the "coverage" of these filters, one loss is used to make all convolutional filters orthogonal to each other:
Another loss is used to directly decorrelates the filters' response maps:
-
-
Class-agnostic matching is achevied by introducing a new bank of 1x1 filters that projects the class-specific responses of the filters F1, F2, ..., FK to L general-purpose response maps. To learn L class-agnostic filters, they use a denoting auto encoder and minimize:
-
The complete architecture is: