A Rust implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning (Siblini et al., 2018).
This implementation has been tested on datasets from the Extreme Classification Repository. Each data set comes either with a single data file and separate files for train / test splits, or with two separate train / test data files.
A data file starts with a header line with three space-separated integers: total number of examples, number of features, and number of labels. Following the header line, there is one line per each example, starting with comma-separated labels, followed by space-separated feature:value pairs:
label1,label2,...labelk ft1:ft1_val ft2:ft2_val ft3:ft3_val .. ftd:ftd_val
A split file is a integer matrix, with one line per row, and columns separated by spaces. The integers are example indices (1-indexed) in the corresponding data file, and each column corresponds to a separate split.
Precisions at 1, 3, and 5 are calculated for models trained with default hyper-parameters, e.g.
craftml train Mediamill/data.txt --cv_splits_path Mediamill/train_split.txtfor Mediamill, which has a single data file and separate train / test split files;
craftml train EURLex-4K/train.txt --test_data EURLex-4K/test.txtfor EURLex-4K, which has separate train / test data files.
These numbers are generally consistent with those reported in the original paper.
Note that if there isn't enough memory to train on a large data set, the
--test_trees_singly flag can be set to only train & test one tree at a time, and discard each tree when it's been tested. This allows one to obtain test results without being able to fit the entire model in memory. One can also tune the
--centroid_preserve_ratio option to trade off between model size and accuracy.
The project can be easily built with Cargo:
$ cargo build --release
The compiled binary file will be available at
$ craftml train --help craftml-train Train a new CRAFTML model USAGE: craftml train [FLAGS] [OPTIONS] <training_data> FLAGS: -h, --help Prints help information --test_trees_singly Test forest tree by tree, freeing each before training the next to reduce memory usage. Model cannot be saved. -V, --version Prints version information OPTIONS: --centroid_min_n_preserve <centroid_min_n_preserve> The minimum number of entries to preserve from puning, regardless preserve ratio setting. [default: 10] --centroid_preserve_ratio <centroid_preserve_ratio> A real number between 0 and 1, which is the ratio of entries with largest absoulte values to preserve. The rest of the entries are pruned. [default: 0.1] --cluster_sample_size <cluster_sample_size> Number of examples drawn for clustering on a branching node [default: 20000] --cv_splits_path <PATH> Path to the k-fold cross validation splits file, with k space-separated columns of indices (starting from 1) for training splits. --k_clusters <k_clusters> Number of clusters on a branching node [default: 10] --leaf_max_size <leaf_max_size> Maximum number of distinct examples on a leaf node [default: 10] --model_path <PATH> Path to which the trained model will be saved if provided --n_cluster_iters <n_cluster_iters> Number of clustering iterations to run on each branching node [default: 2] --n_feature_buckets <n_feature_buckets> Number of buckets into which features are hashed [default: 10000] --n_label_buckets <n_label_buckets> Number of buckets into which labels are hashed [default: 10000] --n_threads <n_threads> Number of worker threads. If 0, the number is selected automatically. [default: 0] --n_trees <n_trees> Number of trees in the random forest [default: 50] --out_path <PATH> Path to the which predictions will be written, if provided --test_data <PATH> Path to test dataset file used to calculate metrics if provided (in the format of the Extreme Classification Repository) ARGS: <training_data> Path to training dataset file (in the format of the Extreme Classification Repository)
$ craftml test --help craftml-test Test an existing CRAFTML model USAGE: craftml test [OPTIONS] <model_path> <test_data> FLAGS: -h, --help Prints help information -V, --version Prints version information OPTIONS: --k_top <k_top> Number of top predictions to write out for each test example [default: 5] --n_threads <n_threads> Number of worker threads. If 0, the number is selected automatically. [default: 0] --out_path <PATH> Path to the which predictions will be written, if provided ARGS: <model_path> Path to the trained model <test_data> Path to test dataset file (in the format of the Extreme Classification Repository)
- Siblini, W., Kuntz, P., & Meyer, F. (2018). CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning. In Proceedings of the 35th International Conference on Machine Learning (Vol. 80, pp. 4664–4673). Stockholmsmässan, Stockholm Sweden: PMLR. http://proceedings.mlr.press/v80/siblini18a.html