/
readme.txt
123 lines (83 loc) · 6.41 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
DiSMEC - Distributed Sparse Machines for Extreme multi-label classification problems
=====================
Some features of DiSMEC
=====================
- C++ code built over the Liblinear-64 [1] bit code-base to handle multi-label datasets in LibSVM format, and handle parallel training and prediction using openMP
- Handles extreme multi-label classification problems consisting of millions of labels in datasets which can be downloaded from Extreme Classification Repository
- Takes only a few minutes on EURLex-4K (eurlex) dataset consisting of about 4,000 labels and a few hours on WikiLSHTC-325K datasets consisting of about 325,000 labels
- Learns models in the batch of 1000 (by default, can be changed to suit your settings) labels
- Allows two-folds parallel training (a) A single batch of labels (say 1000 labels) is learnt in parallel using openMP while exploiting multiple cores on a single machine/node/computer (b) Can be invoked on multiple computers/machines/nodes simultaneously such that the successive launch starts from the next batch of 1000 labels(see -i option below)
- Tested on 64-bit Ubuntu only, not not very clean code but runs
Detailed instructions for using DiSMEC are given below. These instructions are to reproduce the results for Eurlex data only.
For more details, check our paper "DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification" [2] , or email me if anything is unclear
===================
CONTENTS
===================
There are directories
1) ./dismec contains the DiSMEC code
2) ./eurlex consists of data for EURLex-4k downloaded from XMC repository
3) ./prepostprocessing consists of Java code for (a) pre-processing data to get into tf-idf format and remapping labels and features, and (b) Evaluation of precision@k and nDCG@k corresponding to the prediction results.
===================
USAGE
===================
Data Pre-processing (in Java)
0) Download the eurlex dataset from XMC repository, and remove the first line from the train and test files downloaded, call them train.txt and test.txt
1) Change feature ID's so that they start from 1..to..number_of_features, using the code provided in FeatureRemapper.java using the following command
javac FeatureRemapper.java
java FeatureRemapper ../eurlex/train.txt ../eurlex/train-remapped.txt ../eurlex/test.txt ../eurlex/test-remapped.txt
2) Convert to tf-idf format using the code in file TfIdfCalculator.java
javac TfIdfCalculator.java
java TfIdfCalculator ../eurlex/train-remapped.txt ../eurlex/train-remapped-tfidf.txt ../eurlex/test-remapped.txt ../eurlex/test-remapped-tfidf.txt
3) Change labels ID's so that they also start from 1..to..number_of_labels, using the code provided in LabelRelabeler.java
javac LabelRelabeler.java
java LabelRelabeler ../eurlex/train-remapped-tfidf.txt ../eurlex/train-remapped-tfidf-relabeled.txt ../eurlex/test-remapped-tfidf.txt ../eurlex/test-remapped-tfidf-relabeled.txt ../eurlex/label-mappings.txt
===================
Building DiSMEC
Just run make command in the ../dismec/ directory. This will build the train and predict executable
===================
Training model with DiSMEC (in C++)
Training models with DiSMEC with 2-level parallelism
// make the directory to write the model files
mkdir ../eurlex/models
../dismec/train -s 2 -B 1 -i 1 ../eurlex/train-remapped-tfidf-relabeled.txt ../eurlex/models/1.model
../dismec/train -s 2 -B 1 -i 2 ../eurlex/train-remapped-tfidf-relabeled.txt ../eurlex/models/2.model
../dismec/train -s 2 -B 1 -i 3 ../eurlex/train-remapped-tfidf-relabeled.txt ../eurlex/models/3.model
../dismec/train -s 2 -B 1 -i 4 ../eurlex/train-remapped-tfidf-relabeled.txt ../eurlex/models/4.model
If run in parallel on multiple machines, the models might take around 5 mins to build on this dataset.
===================
Predicting with DiSMEC in parallel (in C++)
Since the base Liblinear code does not understand the comma separated labels. We need to zero out labels in the test file, and put that in a separate file (called GS.txt) consisting of only the labels.
javac LabelExtractor.java
java LabelExtractor ../eurlex/test-remapped-tfidf-relabeled.txt ../eurlex/test-remapped-tfidf-relabeled-zeroed.txt ../eurlex/GS.txt
mkdir ../eurlex/output // make the directory to write output files
../dismec/predict ../eurlex/test-remapped-tfidf-relabeled-zeroed.txt ../eurlex/models/1.model ../eurlex/output/1.out
../dismec/predict ../eurlex/test-remapped-tfidf-relabeled-zeroed.txt ../eurlex/models/2.model ../eurlex/output/2.out
../dismec/predict ../eurlex/test-remapped-tfidf-relabeled-zeroed.txt ../eurlex/models/3.model ../eurlex/output/3.out
../dismec/predict ../eurlex/test-remapped-tfidf-relabeled-zeroed.txt ../eurlex/models/4.model ../eurlex/output/4.out
===================
Performance evaluation (in Java)
Computation of Precision@k and nDCG@k for k=1,3,5
Now, we need to get final top-1, top-3 and top-5 from the output of individual models. This is done by the following :
****** IMPORTANT : Change the number of test points in DistributedPredictor.java (at line number 138) based on number of test points in the datasets ******
mkdir ../eurlex/final-output
javac DistributedPredictor.java
java DistributedPredictor ../eurlex/output/ ../eurlex/final-output/top1.out ../eurlex/final-output/top3.out ../eurlex/final-output/top5.out
javac MultiLabelMetrics.java
java MultiLabelMetrics ../eurlex/GS.txt ../eurlex/final-output/top1.out ../eurlex/final-output/top3.out ../eurlex/final-output/top5.out
The expected output should be something like this:
precision at 1 is 82.51380489087562
precision at 3 is 69.48023490227014
precision at 5 is 57.94372863528793
ndcg at 1 is 82.51380489087562
ndcg at 3 is 72.89528751985883
ndcg at 5 is 67.05303357275555
===================
Other Datasets from XMC repository:
If you would like to build for another dataset, please change the number of labels and replace with appropriate number at line number 2301, and then run make
long long nr_class = 3786; //3786 for eurlex
If you would like to change the batch size (1000 by default) for you settings, please replace with appropriate number at line number 2427, and then run make
int batchSize = 1000;
======================
References
[1] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification Journal of Machine Learning Research 9(2008), 1871-1874.
[2] R. Babbar, B. Schölkopf. DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification, WSDM 2017