Skip to content

shizuo-kaji/rankLearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning ranking distributions using hyperplane arrangement

Written by Shizuo KAJI

Look at the following paper for details:

  • "A hyper-surface arrangement model of ranking distributions" by S. Kaji, A. Horiguchi, T. Abe (Kyushu University), and Y. Watanabe (ZOZO Technologies), ACM SIGKDD 2021.

Requirements

  • Python 3: Anaconda is recommended
  • Python libraries: Chainer, chainerui: pip install chainer chainerui
  • (optional) for parallel learning: mpi4py: conda install mpi4py

Terminology

  • label = alternative = item are those that are ranked.
  • instance = agent = judge are those who rank labels.
  • ranking is a partial ordering of labels by an instance.
  • ranking distribution is a probability distribution on the ordering of labels which occurs from a collection of rankings.
  • In our model, labels and instance are represented by points in the m-dimensional ball.
  • ranking distribution can be represented by a n!-dimensional vector, where n is the number of labels.
  • Our model approximates the ranking distribution by the distribution of the volumes of the cells formed by the hyper-plane arrangements in the m-dimensional ball. The dimension of our model is nm.

How to use

  • Partial ranking data should be prepared in a csv file in which each line has the following format:
    id, x, y, z, ...

where x > y > z > ... for instance id. There can be multiple lines having the same id so that any partial ordering can be specified.

  • Learning the model: The following learns the ranking distribution of ranking.csv and outputs labels.csv (learned label coords) and instances.csv (learned instance coords) under the directory named result.
    python rankLearn.py ranking.csv -e 50 -lr 0.01 -lri 1e-3 -o result

The batch size (-b 20), number of epochs (-e 50), learning rate (-lr 0.01) have a large impact on the speed and the accuracy of learning. Note that only labels.csv is needed for sampling from the leaned model.

  • Learning can be resumed by giving initial configurations of label and/or instance coordinates.
    python rankLearn.py ranking.csv -e 50 -lr 0.01 -lri 1e-3 -o result -i instances.csv -l labels.csv
  • Parallel learning: if mpi4py is installed, learning can be parallelised using MPI.
    mpiexec -n 4 python rankLearn.py ranking.csv -e 50 -lr 0.01 -o result --mpi
  • To see the list of command-line options,
    python rankLearn.py -h

Example

  • For the SUSHI dataset, a sample embedding setting is (--batchsize 50 --epoch 100 --dim 9 --margin 0.01 --learning_rate 0.01 --lambda_ord 1 --lambda_repel_instance 0.0001 --lambda_ball 1)

  • For the toy dataset with two items (2label90-10.csv), a sample setting is (-b 100 -e 200 -d 2 -m 0.03 -lr 0.1 -lo 10 -lri 1 -lrl 1e-9 -lb 10 -se)

Visualisation of the first two principal components: 2Items 10Items(sushi)

Evaluating the model

  • To compare the learned model with the data in terms of various metrics,
    python arrangement.py --label labels.csv -r2 ranking.csv --compute_wasserstein --top_n 3
  • To plot the coordinates of labels and instances,
    python arrangement.py --label labels.csv --instance instances.csv --plot

Sample ranking data creation

  • To sample 1000 rankings from the learned model labels.csv,
    python arrangement.py --label labels.csv -ni 1000 -g --dim 2 -o result

The sampled rankings (ranking.csv) and instance coordinates (instances.csv) are found under the directory named result.

  • To create a 2D sample arrangement with 5 labels and 1000 instances,
    python arrangement.py -nl 5 -ni 1000 -g --dim 2
  • To see the list of command-line options,
    python arrangement.py -h

TODO

  • Use the Lehmer code for indexing permutations.

About

Metric Learning of Ranking data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages