Unsupervised Learning using Videos (ICCV 2015)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
include/caffe change Sep 16, 2015
python first commit Sep 16, 2015
rank_scripts/rank_alexnet test Oct 1, 2015
scripts first commit Sep 16, 2015
src norm_layer Oct 20, 2015
CONTRIBUTING.md first commit Sep 16, 2015
CONTRIBUTORS.md first commit Sep 16, 2015
Makefile compiled Sep 16, 2015
Makefile.config.example compiled Sep 16, 2015
README.md Update README.md Jan 19, 2018
caffe.cloc first commit Sep 16, 2015



This code is developed based on Caffe: project site.

This code is the implementation for training the siamese-triplet network in the paper:

Xiaolong Wang and Abhinav Gupta. Unsupervised Learning of Visual Representations using Videos. Proc. of IEEE International Conference on Computer Vision (ICCV), 2015. pdf


Training scripts are in rank_scripts/rank_alexnet:

For implementation, since the siamese networks share the weights, so there is only one network in prototxt.

The input of the network is pairs of image patches. For each pair of patches, they are taken as the similar patches in the same video track. We use the label to specify whether the patches come from the same video, if they come from different videos they will have different labels (it does not matter what is the number, just need to be integer). In this way, we can get the third negative patch from other pairs with different labels.

In the loss, for each pair of patches, it will try to find the third negative patch in the same batch. There are two ways to do it, one is random selection, the other is hard negative mining.

In the prototxt:

layer {		
	name: "loss"	
	type: "RankHardLoss" 	
		neg_num: 4	
		pair_size: 2 	
		hard_ratio: 0.5 	
		rand_ratio: 0.5 	
		margin: 1 	
	bottom: "norml2" 	
	bottom: "label" 	

neg_num means how many negative patches you want for each pair of patches, if it is 4, that means there are 4 triplets. pair_size = 2 just means inputs are pairs of patches. hard_ratio = 0.5 means half of the negative patches are hard examples, rand_ratio = 0.5 means half of the negative patches are randomly selected. For start, you can just set rand_ratio = 1 and hard_ratio = 0. The margin for contrastive loss needs to be designed for different tasks, trying to set margin = 0.5 or 0.1 might make a difference for other tasks.


We offer two models trained with our method:

color model is trained with RGB images. gray model is trained with gray images (3-channel inputs). prototxt is the prototxt for both models. mean is the mean file.

In case our server is down, the models can be downloaded from dropbox:

color model is trained with RGB images. gray model is trained with gray images (3-channel inputs).

Training Patches

The unsupervised mined patches can be downloaded from here: https://www.dropbox.com/sh/vgp2k3mdi61sdgr/AAB9vwX140jppHjp33n4UoO7a?dl=0

Each tar file contains different patches. Note that the file YouTube.tar.gz can be extracted by using "tar xf" even though it is named as "tar.gz" file.

The example of the training list can be downloaded from here: https://www.dropbox.com/s/tnbu2myy7g0i6l6/trainlist.txt?dl=0