This respository origins from a public implementation of Deep Image Retrieval. Although learned form a dataset of landmarks, the global representation it extracts from images is a robust solution for retrieving same-origin frames from a large data-pool of short videos.
To be specific, we measure the similarity of two images by calculating the cosine distance (after normalized) of the two embeddings extracted by the model of CNN and R-MAC. We find that a reasonable similarity threshold is helpful to the selection of the frames that look alike, because there is usually a gap of similarity scores between the same-origin frames and others. The method is especially useful when dealing with videos of something like daily events but not so robust with those of portraits and games, which might call for the fine-grained recognition.
query | top1 | top2 | top3 | top4 | top5 |
---|---|---|---|---|---|
similarity | 0.968 | 0.679 | 0.611 | 0.605 | 0.604 |
query | top1 | top2 | top3 | top4 | top5 |
---|---|---|---|---|---|
similarity | 0.630 | 0.619 | 0.617 | 0.605 | 0.596 |
We notice that the original implementation provides an option of multi-resolution (0.5x/1.0x/1.5x), and we find that it does improve the performance on Oxford, which is measured by mAP. However, there is two points that is not so desirable. Firstly, the multi-resolution calculatio is too redundant to afford. Secondly, the embeddings have a dimensionality of 2048, making it expensive for storage and further calculation. So we have the following attempts to improve.
It is a natural idea whether ultilizing the feature maps of different sizes extracted from the middle layers could improve the descriptor and take similar effects to the multi-resolution (something like SSD). Firstly we concat the features from middle layers directly with the one from the last layer and perform offline PCA (optionally) to get the descriptor, and find that it helps, although only slightly. The results below of offline PCA, which is measured by mAP on Oxford dataset, serves as our baseline and the code can be refered here. As the effects of offline PCA are not remarkable enough, and it might increase the dimensionality so as to obtain enhanced performances, which is not desirable.
model | resolution | PCA or not | number of feature maps | mAP |
---|---|---|---|---|
original | 512 | Y | 1 | 81.11 |
three-resolution | 512 | Y | 1 | 82.88 |
original | 512 | Y | 1 (middle) | 68.33 |
single-resolution | 512 | N | 1 | 76.75 |
single-resolution | 512 | N | 2 | 76.48 |
single-resolution | 512 | N | 4 | 76.61 |
single-resolution | 512 | Y | 2 | 81.62 |
single-resolution | 512 | Y | 4 | 81.78 |
model | resolution | PCA or not | number of feature maps | dim | mAP |
---|---|---|---|---|---|
original | 512 | Y | 1 | 2048 | 81.11 |
single-resolution | 512 | Y | 2 | 3072 | 81.62 |
single-resolution | 512 | Y | 2 | 2048 | 80.33 |
single-resolution | 512 | Y | 4 | 3840 | 81.78 |
single-resolution | 512 | Y | 4 | 2048 | 80.99 |
For further enhancemnet, we refer to knowledge distiiling to transfer the ability of 3-resoltion model to single-resolution. For this task, we have Paris dataset and Landmark dataset for training and have Oxford dataset and cover dataset for validation and test. We use a teacher model of 3-resoltion to extract the descriptors of images on the training set, and train a student model with loss of distance between the teacher descriptors and the student descriptors. Note that training in the way of knowledge distilling and the following metric learning will use the custom implementation of some auxiliary layers written in Python. Considering that the single resolution model has no access to the performance gain of multiple reception fields, so we test different architectures to fuse features from middle layers, and the results are as follow (measured by mAP on Oxford dataset of unified-resolution 512x and cover dataset). Only the original architecture trained by distilling is acceptable, and we guess that the other architectures is unable to maximize the performance due to a small training set (3w+) that is different from the one for training the original model.
model | mAP(uni-Oxford) | mAP(cover) | weights |
---|---|---|---|
original | 80.45 | 98.87 | 187MB |
three-resolution | 82.90 | 98.97 | 187MB |
original fine-tuned | 80.52 | 99.03 | 187MB |
single-resolution: pca-concat-pca | 79.45 | 98.82 | 217MB |
single-resolution: 1x1conv-eltwise-pca | 78.75 | 98.80 | 196MB |
single-resolution: concat-pca-relu | 74.18 | / | 196MB |
As the dimension of desriptors (2048 float) is too large for storage and later calculation, which we wish to cut it down to 512 or smaller. So we train the extractor again, but this time with the loss of metric learning. we use triplet loss firstly but find it hard to converge without hard negative mining and large enough GPU memory. As a result we turn to Lifted Structured Feature Embedding. You can find the configuration of trained model here. After metric-learning, we stack multi-resoltion and perform knowledge distiiling again to make further improvement. Finally we get a descriptor-extractor with reduced-dimensionality, although with slight loss of performance compared to the original model.
We validates the models mentioned above on the cover dataset by percision (number of same-origin ones) and recall (number of unrelevant ones) on 524 groups of images (5k) of cover dataset. Also, you can visualize the clusters of top-k retrieved frames here.
triplet512/threshold | 0.70 | 0.71 | 0.72 | 0.73 | 0.74 | 0.75 | 0.76 | 0.77 | 0.78 | 0.79 | 0.80 |
---|---|---|---|---|---|---|---|---|---|---|---|
same-origin | 1547 | 1530 | 1509 | 1484 | 1464 | 1432 | 1408 | 1383 | 1338 | 1294 | 1254 |
unrelevant | 3677 | 3658 | 3600 | 3478 | 3268 | 2942 | 2587 | 2177 | 1829 | 1515 | 1228 |
distilling/threshold | / | / | 0.56 | 0.57 | 0.58 | 0.59 | 0.60 | 0.61 | 0.62 | 0.63 | 0.64 |
---|---|---|---|---|---|---|---|---|---|---|---|
same-origin | ------- | ------- | 1504 | 1488 | 1467 | 1437 | 1417 | 1390 | 1369 | 1342 | 1298 |
unrelevant | ------- | ------- | 3127 | 2773 | 2436 | 2098 | 1802 | 1521 | 1316 | 1131 | 948 |
multires/threshold | / | / | 0.65 | 0.66 | 0.67 | 0.68 | 0.69 | 0.70 | 0.71 | 0.72 | 0.73 |
---|---|---|---|---|---|---|---|---|---|---|---|
same-origin | ------- | ------- | 1514 | 1493 | 1460 | 1429 | 1396 | 1367 | 1335 | 1313 | 1272 |
unrelevant | ------- | ------- | 3393 | 2951 | 2464 | 2016 | 1647 | 1361 | 1121 | 945 | 781 |
Actually we also try another way to reduce the dimensionality, that is, train the model with pair loss by calculating the distance of all the pairs within a batch, and it reaches similar effects as "triplet512" above, which is about a mAP of 78.5x on uni-oxford dataset.
The dataset used here includes Oxford dataset, Paris dataset, Landmark dataset and cover dataset.
The Python code is under the path 'myPython/' as follow:
├── myPython/
│ ├── custom_layers.py /* custom Python layers
│ ├── model_tools.py /* visualization and modification of Caffe weights
│ ├── region_generator.py /* RoI generator given the resolution
│ ├── check_on_cover.py /* test of precision and recall under different similarity threshold on cover dataset
│ ├── train.py /* train using Caffe PythonAPI
│ ├── dataset_helper/ /* operations on several dataset
│ │ ├── cover_helper.py /* cover
│ │ ├── landmark_helper.py /* Landmark
│ │ ├── paris_helper.py /* Paris
│ │ ├── oxford_helper.py /* Oxford
│ │ ├── download_helper.py /* download images by cmsid
│ ├── test/ /* evaluation
│ │ ├── test_on_oxford.py /* Oxford
│ │ ├── test_on_paris.py /* Paris
│ │ ├── test_on_landmark.py /* Landmark
│ │ ├── test_on_cover.py /* cover
│ │ ├── test_SimilarCluster.py /* visualization of the similar clusters
│ ├── convert/ /* convert the images into embeddings
│ ├── offline/ /* offline PCA experiments
The configuration of models and their weights are under the path 'proto/' and 'caffemodel/' as follow:
├── proto/
│ ├── offline/ /* offline PCA experiments
│ ├── distilling/ /* knowledge distilling(including the 3-resolution teacher model)
│ ├── triplet/ /* training with triplet-loss
│ ├── reduce/ /* training with pair loss (parallel for the RoI features,serial for the embeddings)
│ ├── deploy_resnet101.prototxt /* original model
│ ├── deploy_resnet101_normpython.prototxt /* original model with Python normalize_layer
The weights of the original model and the distilled one could be downloaded from the Google Drive.
This package contains the pretrained ResNet101 model and evaluation script for the method proposed in the following papers:
- Deep Image Retrieval: Learning global representations for image search. A. Gordo, J. Almazan, J. Revaud, and D. Larlus. In ECCV, 2016
- End-to-end Learning of Deep Visual Representations for Image Retrieval. A. Gordo, J. Almazan, J. Revaud, and D. Larlus. CoRR abs/1610.07940, 2016
- Caffe
- Region of Interest pooling layer (ROIPooling). This is the same layer used by fast RCNN and faster RCNN. A C++ implementation can be found in BVLC/caffe#4163
- L2-normalization layer (Normalize). Implemented in C++ in https://github.com/happynear/caffe-windows. As an alternative, we provide a python implementation of this layer that produces the same results, but is less efficient and does not implement backpropagation.
The evaluation script is prepared to work on the Oxford 5k and Paris 6k datasets. To set up the datasets:
mkdir datasets
cd datasets
Evaluation script:
mkdir evaluation
cd evaluation
wget http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/compute_ap.cpp
sed -i '6i#include <cstdlib>' compute_ap.cpp # Add cstdlib, as some compilers will produce an error otherwise
g++ -o compute_ap compute_ap.cpp
cd ..
Oxford:
mkdir -p Oxford
cd Oxford
mkdir jpg lab
wget http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/oxbuild_images.tgz
tar -xzf oxbuild_images.tgz -C jpg
wget http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/gt_files_170407.tgz
tar -xzf gt_files_170407.tgz -C lab
cd ..
Paris
mkdir -p Paris
cd Paris
mkdir jpg lab tmp
# Images are in a different folder structure, need to move them around
wget http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/paris_1.tgz
wget http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/paris_2.tgz
tar -xzf paris_1.tgz -C tmp
tar -xzf paris_2.tgz -C tmp
find tmp -type f -exec mv {} jpg/ \;
rm -rf tmp
wget http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/paris_120310.tgz
tar -xzf paris_120310.tgz -C lab
cd ..
cd ..
$ python test.py
usage: test.py [-h] --gpu GPU --S S --L L --proto PROTO --weights WEIGHTS
--dataset DATASET --dataset_name DATASET_NAME --eval_binary
EVAL_BINARY --temp_dir TEMP_DIR [--multires] [--aqe AQE]
[--dbe DBE]
G: gpu id
S: size to resize the largest side of the images to. The model is trained with S=800, but different values may work better depending on the task.
L: number of levels of the rigid grid. Model was trained with L=2, but different levels (e.g. L=1 or L=3) may work better on other tasks.
PROTO: path to the prototxt. There are two prototxts included.
deploy_resnet101.prototxt relies on caffe being compiled with the normalization layer.
deploy_resnet101_normpython.prototxt does not have that requirement as it relies on the python implementation, but it may be slower as it is done on the cpu and does not implement backpropagation.
WEIGHTS: path to the caffemodel
DATASET: path to the dataset, for Oxford and Paris it is the directory that contains the jpg and lab folders.
DATASET_NAME: either Oxford or Paris
EVAL_BINARY: path to the compute_ap binary provided with Oxford and Paris used to compute the ap scores
TEMP_DIR: a temporary directory to store features and scores
Note that this model does not implement the region proposal network.
Adjust paths as necessary:
Rigid grid, no multiresolution, no query expansion or database side feature augmentation:
python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Oxford --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Oxford
Expected accuracy: 84.09
python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Paris --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Paris
Expected accuracy: 93.57
Rigid grid, multiresolution, no query expansion or database side feature augmentation:
python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Oxford --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Oxford --multires
Expected accuracy: 86.07
python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Paris --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Paris --multires
Expected accuracy: 94.53
Rigid grid, multiresolution, query expansion (k=1) and database side feature augmentation (k=20):
python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Oxford --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Oxford –multires --aqe 1 --dbe 20
Expected accuracy: 94.68
python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Paris --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Paris –multires --aqe 1 --dbe 20
Expected accuracy: 96.58
If you use these models in your research, please cite:
@inproceedings{Gordo2016a,
title={Deep Image Retrieval: Learning global representations for image search},
author={Albert Gordo and Jon Almazan and Jerome Revaud and Diane Larlus},
booktitke={ECCV},
year={2016}
}
@article{Gordo2016b,
title={End-to-end Learning of Deep Visual Representations for Image Retrieval}
author={Albert Gordo and Jon Almazan and Jerome Revaud and Diane Larlus},
journal={CoRR abs/1610.07940},
year={2016}
}
Please see LICENSE.txt
for the license information.