my Deep Retrieval

This respository origins from a public implementation of Deep Image Retrieval. Although learned form a dataset of landmarks, the global representation it extracts from images is a robust solution for retrieving same-origin frames from a large data-pool of short videos.

To be specific, we measure the similarity of two images by calculating the cosine distance (after normalized) of the two embeddings extracted by the model of CNN and R-MAC. We find that a reasonable similarity threshold is helpful to the selection of the frames that look alike, because there is usually a gap of similarity scores between the same-origin frames and others. The method is especially useful when dealing with videos of something like daily events but not so robust with those of portraits and games, which might call for the fine-grained recognition.

query	top1	top2	top3	top4	top5

similarity	0.968	0.679	0.611	0.605	0.604

query	top1	top2	top3	top4	top5

similarity	0.630	0.619	0.617	0.605	0.596

We notice that the original implementation provides an option of multi-resolution (0.5x/1.0x/1.5x), and we find that it does improve the performance on Oxford, which is measured by mAP. However, there is two points that is not so desirable. Firstly, the multi-resolution calculatio is too redundant to afford. Secondly, the embeddings have a dimensionality of 2048, making it expensive for storage and further calculation. So we have the following attempts to improve.

Offline PCA

It is a natural idea whether ultilizing the feature maps of different sizes extracted from the middle layers could improve the descriptor and take similar effects to the multi-resolution (something like SSD). Firstly we concat the features from middle layers directly with the one from the last layer and perform offline PCA (optionally) to get the descriptor, and find that it helps, although only slightly. The results below of offline PCA, which is measured by mAP on Oxford dataset, serves as our baseline and the code can be refered here. As the effects of offline PCA are not remarkable enough, and it might increase the dimensionality so as to obtain enhanced performances, which is not desirable.

model	resolution	PCA or not	number of feature maps	mAP
original	512	Y	1	81.11
three-resolution	512	Y	1	82.88
original	512	Y	1 (middle)	68.33
single-resolution	512	N	1	76.75
single-resolution	512	N	2	76.48
single-resolution	512	N	4	76.61
single-resolution	512	Y	2	81.62
single-resolution	512	Y	4	81.78

model	resolution	PCA or not	number of feature maps	dim	mAP
original	512	Y	1	2048	81.11
single-resolution	512	Y	2	3072	81.62
single-resolution	512	Y	2	2048	80.33
single-resolution	512	Y	4	3840	81.78
single-resolution	512	Y	4	2048	80.99

Knowledge Distiiling

For further enhancemnet, we refer to knowledge distiiling to transfer the ability of 3-resoltion model to single-resolution. For this task, we have Paris dataset and Landmark dataset for training and have Oxford dataset and cover dataset for validation and test. We use a teacher model of 3-resoltion to extract the descriptors of images on the training set, and train a student model with loss of distance between the teacher descriptors and the student descriptors. Note that training in the way of knowledge distilling and the following metric learning will use the custom implementation of some auxiliary layers written in Python. Considering that the single resolution model has no access to the performance gain of multiple reception fields, so we test different architectures to fuse features from middle layers, and the results are as follow (measured by mAP on Oxford dataset of unified-resolution 512x and cover dataset). Only the original architecture trained by distilling is acceptable, and we guess that the other architectures is unable to maximize the performance due to a small training set (3w+) that is different from the one for training the original model.

model	mAP(uni-Oxford)	mAP(cover)	weights
original	80.45	98.87	187MB
three-resolution	82.90	98.97	187MB
original fine-tuned	80.52	99.03	187MB
single-resolution: pca-concat-pca	79.45	98.82	217MB
single-resolution: 1x1conv-eltwise-pca	78.75	98.80	196MB
single-resolution: concat-pca-relu	74.18	/	196MB

Metric Learning

As the dimension of desriptors (2048 float) is too large for storage and later calculation, which we wish to cut it down to 512 or smaller. So we train the extractor again, but this time with the loss of metric learning. we use triplet loss firstly but find it hard to converge without hard negative mining and large enough GPU memory. As a result we turn to Lifted Structured Feature Embedding. You can find the configuration of trained model here. After metric-learning, we stack multi-resoltion and perform knowledge distiiling again to make further improvement. Finally we get a descriptor-extractor with reduced-dimensionality, although with slight loss of performance compared to the original model.

We validates the models mentioned above on the cover dataset by percision (number of same-origin ones) and recall (number of unrelevant ones) on 524 groups of images (5k) of cover dataset. Also, you can visualize the clusters of top-k retrieved frames here.

triplet512/threshold	0.70	0.71	0.72	0.73	0.74	0.75	0.76	0.77	0.78	0.79	0.80
same-origin	1547	1530	1509	1484	1464	1432	1408	1383	1338	1294	1254
unrelevant	3677	3658	3600	3478	3268	2942	2587	2177	1829	1515	1228

distilling/threshold	/	/	0.56	0.57	0.58	0.59	0.60	0.61	0.62	0.63	0.64
same-origin	-------	-------	1504	1488	1467	1437	1417	1390	1369	1342	1298
unrelevant	-------	-------	3127	2773	2436	2098	1802	1521	1316	1131	948

multires/threshold	/	/	0.65	0.66	0.67	0.68	0.69	0.70	0.71	0.72	0.73
same-origin	-------	-------	1514	1493	1460	1429	1396	1367	1335	1313	1272
unrelevant	-------	-------	3393	2951	2464	2016	1647	1361	1121	945	781

Actually we also try another way to reduce the dimensionality, that is, train the model with pair loss by calculating the distance of all the pairs within a batch, and it reaches similar effects as "triplet512" above, which is about a mAP of 78.5x on uni-oxford dataset.

Dataset && Code

The dataset used here includes Oxford dataset, Paris dataset, Landmark dataset and cover dataset.

The Python code is under the path 'myPython/' as follow:

├── myPython/
│   ├── custom_layers.py              /* custom Python layers
│   ├── model_tools.py                /* visualization and modification of Caffe weights
│   ├── region_generator.py           /* RoI generator given the resolution
│   ├── check_on_cover.py             /* test of precision and recall under different similarity threshold on cover dataset
│   ├── train.py                      /* train using Caffe PythonAPI
│   ├── dataset_helper/               /* operations on several dataset
│   │   ├── cover_helper.py           /* cover
│   │   ├── landmark_helper.py        /* Landmark
│   │   ├── paris_helper.py           /* Paris
│   │   ├── oxford_helper.py          /* Oxford
│   │   ├── download_helper.py        /* download images by cmsid
│   ├── test/                         /* evaluation
│   │   ├── test_on_oxford.py         /* Oxford
│   │   ├── test_on_paris.py          /* Paris
│   │   ├── test_on_landmark.py       /* Landmark
│   │   ├── test_on_cover.py          /* cover
│   │   ├── test_SimilarCluster.py    /* visualization of the similar clusters
│   ├── convert/                      /* convert the images into embeddings
│   ├── offline/                      /* offline PCA experiments

The configuration of models and their weights are under the path 'proto/' and 'caffemodel/' as follow：

├── proto/
│   ├── offline/                      /* offline PCA experiments
│   ├── distilling/                   /* knowledge distilling（including the 3-resolution teacher model）
│   ├── triplet/                      /* training with triplet-loss
│   ├── reduce/                       /* training with pair loss (parallel for the RoI features，serial for the embeddings)
│   ├── deploy_resnet101.prototxt     /* original model
│   ├── deploy_resnet101_normpython.prototxt     /* original model with Python normalize_layer

The weights of the original model and the distilled one could be downloaded from the Google Drive.

Deep Retrieval

This package contains the pretrained ResNet101 model and evaluation script for the method proposed in the following papers:

Deep Image Retrieval: Learning global representations for image search. A. Gordo, J. Almazan, J. Revaud, and D. Larlus. In ECCV, 2016
End-to-end Learning of Deep Visual Representations for Image Retrieval. A. Gordo, J. Almazan, J. Revaud, and D. Larlus. CoRR abs/1610.07940, 2016

Dependencies:

Caffe
Region of Interest pooling layer (ROIPooling). This is the same layer used by fast RCNN and faster RCNN. A C++ implementation can be found in BVLC/caffe#4163
L2-normalization layer (Normalize). Implemented in C++ in https://github.com/happynear/caffe-windows. As an alternative, we provide a python implementation of this layer that produces the same results, but is less efficient and does not implement backpropagation.

Datasets

The evaluation script is prepared to work on the Oxford 5k and Paris 6k datasets. To set up the datasets:

mkdir datasets
cd datasets

Evaluation script:

mkdir evaluation
cd evaluation
wget http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/compute_ap.cpp
sed -i '6i#include <cstdlib>' compute_ap.cpp # Add cstdlib, as some compilers will produce an error otherwise
g++ -o compute_ap compute_ap.cpp
cd ..

Oxford:

mkdir -p Oxford
cd Oxford
mkdir jpg lab
wget http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/oxbuild_images.tgz
tar -xzf oxbuild_images.tgz -C jpg
wget http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/gt_files_170407.tgz
tar -xzf gt_files_170407.tgz -C lab
cd ..

Paris

mkdir -p Paris
cd Paris
mkdir jpg lab tmp
# Images are in a different folder structure, need to move them around
wget http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/paris_1.tgz
wget http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/paris_2.tgz
tar -xzf paris_1.tgz -C tmp
tar -xzf paris_2.tgz -C tmp
find tmp -type f -exec mv {} jpg/ \;
rm -rf tmp
wget http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/paris_120310.tgz
tar -xzf paris_120310.tgz -C lab
cd ..
cd ..

Usage

$ python test.py

usage: test.py [-h] --gpu GPU --S S --L L --proto PROTO --weights WEIGHTS
               --dataset DATASET --dataset_name DATASET_NAME --eval_binary
               EVAL_BINARY --temp_dir TEMP_DIR [--multires] [--aqe AQE]
               [--dbe DBE]

G: gpu id
S: size to resize the largest side of the images to. The model is trained with S=800, but different values may work better depending on the task.
L: number of levels of the rigid grid. Model was trained with L=2, but different levels (e.g. L=1 or L=3) may work better on other tasks.
PROTO: path to the prototxt. There are two prototxts included.
  deploy_resnet101.prototxt relies on caffe being compiled with the normalization layer.
  deploy_resnet101_normpython.prototxt does not have that requirement as it relies on the python implementation, but it may be slower as it is done on the cpu and does not implement backpropagation.
WEIGHTS: path to the caffemodel
DATASET: path to the dataset, for Oxford and Paris it is the directory that contains the jpg and lab folders.
DATASET_NAME: either Oxford or Paris
EVAL_BINARY: path to the compute_ap binary provided with Oxford and Paris used to compute the ap scores
TEMP_DIR: a temporary directory to store features and scores

Note that this model does not implement the region proposal network.

Examples

Adjust paths as necessary:

Rigid grid, no multiresolution, no query expansion or database side feature augmentation:

python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Oxford --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Oxford

Expected accuracy: 84.09

python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Paris --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Paris

Expected accuracy: 93.57

Rigid grid, multiresolution, no query expansion or database side feature augmentation:

python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Oxford --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Oxford --multires

Expected accuracy: 86.07

python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Paris --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Paris --multires

Expected accuracy: 94.53

Rigid grid, multiresolution, query expansion (k=1) and database side feature augmentation (k=20):

python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Oxford --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Oxford –multires --aqe 1 --dbe 20

Expected accuracy: 94.68

python test.py --gpu 0 --S 800 --L 2 --proto deploy_resnet101_normpython.prototxt --weights model.caffemodel --dataset datasets/Paris --eval_binary datasets/evaluation/compute_ap --temp_dir tmp --dataset_name Paris –multires --aqe 1 --dbe 20

Expected accuracy: 96.58

Citation

If you use these models in your research, please cite:

@inproceedings{Gordo2016a,
      title={Deep Image Retrieval: Learning global representations for image search},
      author={Albert Gordo and Jon Almazan and Jerome Revaud and Diane Larlus},
      booktitke={ECCV},
      year={2016}
}   
@article{Gordo2016b,
      title={End-to-end Learning of Deep Visual Representations for Image Retrieval}
      author={Albert Gordo and Jon Almazan and Jerome Revaud and Diane Larlus},
      journal={CoRR abs/1610.07940},
      year={2016}
}

Please see LICENSE.txt for the license information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval

eval

images

images

myPython

myPython

proto

proto

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

test.py

test.py

test.sh

test.sh

Repository files navigation

my Deep Retrieval

Offline PCA

Knowledge Distiiling

Metric Learning

Dataset && Code

Deep Retrieval

Dependencies:

Datasets

Usage

Examples

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
eval		eval
images		images
myPython		myPython
proto		proto
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
test.py		test.py
test.sh		test.sh

License

wozhouh/my-deep-retrieval

Folders and files

Latest commit

History

Repository files navigation

my Deep Retrieval

Offline PCA

Knowledge Distiiling

Metric Learning

Dataset && Code

Deep Retrieval

Dependencies:

Datasets

Usage

Examples

Citation

About

Resources

License

Stars

Watchers

Forks

Languages