Arborist: Parallelized, Extensible Random Forests
The Arborist provides a fast, open-source implementation of Leo Brieman's Random Forest algorithm. The Arborist achieves its speed through efficient C++ code and parallel, distributed tree construction.
Bindings are available for R and Python (evolving)
R
The Arborist is available on CRAN as the Rborist package.
Installation of Release Version:
> install.packages('Rborist')
Installation of Development Version:
> ./Rborist/Package/Rborist.CRAN.sh
> R CMD INSTALL Rborist_*.*-*.tar.gz
Notes
- Rborist version 0.2-3 passes all checks on CRAN.
Python
- Version 0.1-0 has been archived.
- Version 0.2-4 is under active development.
- Test cases sought.
Performance
Performance metrics have been measured using benchm-ml. Partial results can be found here
This paper compares several implementations of the Random Forest algorithm, including Rborist: (https://www.jstatsoft.org/article/view/v077i01/v77i01.pdf). Benchmarks used in the study are also provided at https://www.jstatsoft.org/article/view/v077i01.
A recent paper compares several categories of regression tools, including Random Forests. Rborist is among the faster packages offering high prediction accuracy: (https://doi.org/10.1109/ACCESS.2019.2933261). Based on the findings, we are investigating changes to the package's default settings. In particular, fixed-number predictor sampling (mtry) appears to provide more accurate predictions at low dimension than the current approach of Bernoulli sampling.
References
- Scalability Issues in Training Decision Trees (video), Nimbix Developer Summit, 2017.
- Controlling for Monotonicity in Random Forest Regressors (PDF), R in Finance, May 2016.
- Accelerating the Random Forest algorithm for commodity parallel hardware (Video), PyData, July, 2015.
- The Arborist: A High-Performance Random Forest (TM) Implementation, R in Finance, May 2015.
- Training Random Forests on the GPU: Tree Unrolling (PDF), GTC, March, 2015.
News/Changes
- Version 0.2-4 to support prediction/validation for large (> 32 bits) observation counts.
- New option 'impPermute' introduces permutation-based variable importance.
- New option 'nThread' enables specification of OpenMP thread count.
- New option 'oob' constrains prediction to the out-of-bag set, essential for variable importance testing.
Correctness and runtime errors are addressed as received. With reproducible test cases, repairs are typically uploaded to GitHub within several days.
Feature requests are addressed on a case-by-case basis.