An R package implementing a parallel Dirichlet Random Forest for modeling compositional data, built on a Dirichlet log-likelihood splitting criterion and OpenMP-accelerated tree construction.
devtools::install_github("Xaleed/DirichletRF")The GitHub version includes several features not yet available on CRAN:
- Out-of-bag (OOB) predictions and error estimation
- Distributional predictions and proximity weights
- Feature importance (gain, split count, permutation)
install.packages("DirichletRF")
⚠️ The CRAN version does not include OOB estimation, distributional mode, or importance measures. Use the GitHub version for full functionality.
install.packages(
"https://github.com/Xaleed/DirichletRF/releases/download/v0.1.0/DirichletRF_0.1.0.zip",
repos = NULL, type = "win.binary"
)library(DirichletRF)
set.seed(42)
n <- 200; p <- 5
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("X", 1:p)
# X1 and X2 are informative
alpha_mat <- cbind(2 + 3 * (X[,1] > 0), 3 + 3 * (X[,2] > 0), rep(4, n))
G <- matrix(rgamma(n * 3, shape = as.vector(t(alpha_mat))), n, 3, byrow = TRUE)
Y <- G / rowSums(G)
forest <- DirichletRF(X, Y, num.trees = 100)
print(forest)Xtest <- matrix(rnorm(10 * p), 10, p)
colnames(Xtest) <- paste0("X", 1:p)
pred <- predict(forest, Xtest)
pred$mean_predictions # mean-based predictions
pred$alpha_predictions / rowSums(pred$alpha_predictions) # parameter-basedRequires replace = TRUE or sample.fraction < 1. Not available in the CRAN version.
# Bootstrap OOB
forest_oob <- DirichletRF(X, Y, num.trees = 100,
replace = TRUE, compute.oob = TRUE)
forest_oob$oob$mse
forest_oob$oob$predictions # n x k matrix, NA where never OOB
# Subsampling OOB
forest_sub <- DirichletRF(X, Y, num.trees = 100,
replace = FALSE, sample.fraction = 0.632,
compute.oob = TRUE)
forest_sub$oob$mseNot available in the CRAN version.
# Impurity-based: gain and split count
importance(forest)
# Permutation-based (requires compute.oob = TRUE)
permutation_importance(forest_oob, X, loss = "aitchison",
num.permutations = 10, seed = 42L)Both return a data frame sorted by importance descending. permutation_importance() supports "aitchison" (default), "mse", and "kl" as loss functions.
Stores leaf indices so the forest can return a weighted distribution over training observations for any test point. Not available in the CRAN version.
forest_dist <- DirichletRF(X, Y, num.trees = 100, distributional = TRUE)
# n_test x n_train weight matrix; rows sum to 1
W <- predict_weights(forest_dist, Xtest)
# Weighted conditional mean
Y_hat <- W %*% Y
# Draw samples from the conditional distribution P(Y | x_new)
draws <- sample_conditional(forest_dist, x_new = Xtest[1, ], size = 200L)
colMeans(draws) # estimated conditional meanAvailable when both distributional = TRUE and compute.oob = TRUE.
forest_full <- DirichletRF(X, Y, num.trees = 100,
distributional = TRUE,
replace = TRUE, compute.oob = TRUE)
W_oob <- forest_full$oob$weights # n x n, generally asymmetric
W_sym <- (W_oob + t(W_oob)) / 2 # symmetrise if needed| Parameter | Default | Description |
|---|---|---|
num.trees |
100 |
Number of trees |
max.depth |
10 |
Maximum tree depth |
min.node.size |
5 |
Minimum leaf size |
mtry |
-1 (= √p) |
Candidate features per split |
est.method |
"mom" |
"mom" or "mle" |
num.cores |
-1 (all−1) |
OpenMP threads |
replace |
FALSE |
Bootstrap or subsampling |
sample.fraction |
1.0 |
Fraction of data per tree |
compute.oob |
FALSE |
Compute OOB predictions |
distributional |
FALSE |
Store leaf indices for weight-based predictions |
Note:
Xmust be numeric. Use one-hot encoding for categorical covariates. Rows ofYmust sum to 1.
Masoumifard, K., van der Westhuizen, S., & Gardner-Lubbe, S. (2026). Dirichlet random forest for predicting compositional data. In A. Bekker et al. (Eds.), Environmental Modelling with Contemporary Statistics. Chapman & Hall/CRC. ISBN: 9781032903910.
GPL-3
