Arborist: Parallelized, Extensible Random Forests
The Arborist provides a fast, open-source implementation of Leo Brieman's Random Forest algorithm. The Arborist achieves its speed through efficient C++ code and parallel, distributed tree construction.
Bindings are available for R and Python (evolving)
R
The Arborist is available on CRAN as the Rborist package.
Installation of Release Version:
> install.packages('Rborist')
Installation of Development Version:
> ./Rborist/Package/Rborist.CRAN.sh
> R CMD INSTALL Rborist_*.*-*.tar.gz
Notes
- Rborist version 0.2-3 passes all checks on CRAN.
Python
- Version 0.1-0 has been archived.
- Version 0.2-4 is under active development.
- Test cases sought.
Performance
Performance metrics have been measured using benchm-ml. Partial results can be found here
Some users have reported diminished performance when running single-threaded. We recommend running with at least two cores, as frequently-executed inner loops have been cast specifically to take advantage of multiple cores. In particular, when using a scaffold such as caret, please prefer to let Rborist be greedier with cores than is the scaffold.
This paper compares several implementations of the Random Forest algorithm, including Rborist: (https://www.jstatsoft.org/article/view/v077i01/v77i01.pdf). Benchmarks used in the study are also provided at https://www.jstatsoft.org/article/view/v077i01.
A recent paper compares several categories of regression tools, including Random Forests. Rborist is among the faster packages offering high prediction accuracy: (https://doi.org/10.1109/ACCESS.2019.2933261). Based on the findings, we are investigating changes to the package's default settings. In particular, fixed-number predictor sampling (mtry) appears to provide more accurate predictions at low dimension than the current approach of Bernoulli sampling.
References
- Scalability Issues in Training Decision Trees (video), Nimbix Developer Summit, 2017.
- Controlling for Monotonicity in Random Forest Regressors (PDF), R in Finance, May 2016.
- Accelerating the Random Forest algorithm for commodity parallel hardware (Video), PyData, July, 2015.
- The Arborist: A High-Performance Random Forest (TM) Implementation, R in Finance, May 2015.
- Training Random Forests on the GPU: Tree Unrolling (PDF), GTC, March, 2015.
News/Changes
- Prediction and validiation support large (> 32 bits) observation counts.
- Trained forest index ranges may now exceed 32 bits. Index ranges for individual trees remain constrained to 32 bits, for now.
- New option 'keyed' identifies predictors by name, rather than position within frame.
- New option 'impPermute' introduces permutation-based variable importance.
Correctness and runtime errors are addressed as received. With reproducible test cases, repairs are typically uploaded to GitHub within several days.
Feature requests are addressed on a case-by-case basis.