face: https://drive.google.com/open?id=16aNBnAGg6BrrTVxOXsjGnvlr89vJYX5q
fasttext:https://drive.google.com/open?id=1FuWBcNxT_wO5cSCTJXJ6a6JjfD_vhfSb
youtube:https://drive.google.com/open?id=1vn30JLdqlRf5qW62XussqpnnnwamXIwH
mkdir data
cd data
mkdir face face/real_data fasttext_cos fasttext_cos/real_data fasttext_eu fasttext_eu/real_data youtube youtube/real_data
Download the above datasets and put them into the directory "real_data":
- face_originalData.npy -> ./data/face/real_data/face_originalData.npy
- fasttext_originalData.npy -> ./data/fasttext_cos/real_data/fasttext_cos_originalData.npy
- fasttext_originalData.npy -> ./data/fasttext_eu/real_data/fasttext_eu_originalData.npy
- youtube_originalData.npy -> ./data/youtube/real_data/youtube_originalData.npy
When new datasets are used, there are three important hyper-parameters that need to be manually set in the source code.
-
max_tau: the maximum value of thresholds. SelNet only accepts thresholds within [0, max_tau]. We need to transform thresholds in the training and testing data into the range [0, max_tau], otherwise SelNet would automatically map thresholds that are larger than max_tau to [0, max_tau], and the performance would be very bad because SelNet considers the monotonic property in the first priority.
-
tau_part_num: the number of used knot points (or piecewise linear functions) in [0, max_tau].
-
epochs: the maximum number of epochs. A small value of epochs is enough and the model will converge soon.
.
├── data
│ ├── face
│ ├── fasttext_cos
│ ├── fasttext_eu
│ └── youtube
├── model
│ ├── Dispatcher.py
│ ├── __pycache__
│ ├── selnet.py
│ └── selnetpart.py
├── proc
│ ├── covertree
│ ├── randompartition
│ ├── shell
│ └── train
└── run
├── CoverTree
├── RandomPartition
└── one
- data: the dataset and all training information
- model: SelNet model
- proc: scripts for generating training data, cover tree and random partition strategies.
- run: scripts to train and inference
- one: run SelNet without partition strategies
- CoverTree: run SelNet with Cover Tree partition
- RandomPartition: run SelNet with Random partition
Here use face dataset as example.
Notice currently the hyper-parameter "epochs" in the source code is set as the maximum value, i.e., 1500, but there is no need to train with so many epochs.
cd ./proc/shell
./train.sh face
The script train.sh generates training, validation and testing data for SelNet without partition.
This might take a long time to compute labels for training, validation and testing queries.
There are two options: cover tree partition or random partition.
./train_ct.sh face
./train_rp.sh face
All data are stored in the directory ./data/face/train/
This also need a long time to generate training data of CT or RP.
For other datasets, just replace "face" with "fasttext_cos", "fasttext_eu" or "youtube"
cd ./run/one/
./train_face_d128_2M_smallSel_huber_log.sh
python predict_face_d128_2M_smallSel_huber_log.py
cd ./run/CoverTree
./train_one_face_d128_2M_smallSel_huber_log.sh
python predict_one_face_d128_2M_smallSel_huber_log.py
cd ./run/RandomPartition
./train_one_face_d128_2M_smallSel_huber_log.sh
python predict_one_face_d128_2M_smallSel_huber_log.py
We can run other datasets as the same with face.