Skip to content

Xaleed/DirichletRF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DirichletRF

DirichletRF Logo

An R package implementing a parallel Dirichlet Random Forest for modeling compositional data, built on a Dirichlet log-likelihood splitting criterion and OpenMP-accelerated tree construction.


📦 Installation

⭐ Recommended: Install from GitHub (latest version)

devtools::install_github("Xaleed/DirichletRF")

The GitHub version includes several features not yet available on CRAN:

  • Out-of-bag (OOB) predictions and error estimation
  • Distributional predictions and proximity weights
  • Feature importance (gain, split count, permutation)

Install from CRAN (limited features)

install.packages("DirichletRF")

⚠️ The CRAN version does not include OOB estimation, distributional mode, or importance measures. Use the GitHub version for full functionality.

Windows binary (no Rtools required)

install.packages(
  "https://github.com/Xaleed/DirichletRF/releases/download/v0.1.0/DirichletRF_0.1.0.zip",
  repos = NULL, type = "win.binary"
)

🚀 Quick Start

library(DirichletRF)

set.seed(42)
n <- 200; p <- 5
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("X", 1:p)

# X1 and X2 are informative
alpha_mat <- cbind(2 + 3 * (X[,1] > 0), 3 + 3 * (X[,2] > 0), rep(4, n))
G <- matrix(rgamma(n * 3, shape = as.vector(t(alpha_mat))), n, 3, byrow = TRUE)
Y <- G / rowSums(G)

forest <- DirichletRF(X, Y, num.trees = 100)
print(forest)

🔧 Features & Examples

Prediction

Xtest <- matrix(rnorm(10 * p), 10, p)
colnames(Xtest) <- paste0("X", 1:p)

pred <- predict(forest, Xtest)
pred$mean_predictions                              # mean-based predictions
pred$alpha_predictions / rowSums(pred$alpha_predictions)  # parameter-based

OOB Estimation

Requires replace = TRUE or sample.fraction < 1. Not available in the CRAN version.

# Bootstrap OOB
forest_oob <- DirichletRF(X, Y, num.trees = 100,
                           replace = TRUE, compute.oob = TRUE)
forest_oob$oob$mse
forest_oob$oob$predictions   # n x k matrix, NA where never OOB

# Subsampling OOB
forest_sub <- DirichletRF(X, Y, num.trees = 100,
                           replace = FALSE, sample.fraction = 0.632,
                           compute.oob = TRUE)
forest_sub$oob$mse

Feature Importance

Not available in the CRAN version.

# Impurity-based: gain and split count
importance(forest)

# Permutation-based (requires compute.oob = TRUE)
permutation_importance(forest_oob, X, loss = "aitchison",
                       num.permutations = 10, seed = 42L)

Both return a data frame sorted by importance descending. permutation_importance() supports "aitchison" (default), "mse", and "kl" as loss functions.


Distributional Mode & Proximity Weights

Stores leaf indices so the forest can return a weighted distribution over training observations for any test point. Not available in the CRAN version.

forest_dist <- DirichletRF(X, Y, num.trees = 100, distributional = TRUE)

# n_test x n_train weight matrix; rows sum to 1
W <- predict_weights(forest_dist, Xtest)

# Weighted conditional mean
Y_hat <- W %*% Y

# Draw samples from the conditional distribution P(Y | x_new)
draws <- sample_conditional(forest_dist, x_new = Xtest[1, ], size = 200L)
colMeans(draws)   # estimated conditional mean

OOB Proximity Matrix

Available when both distributional = TRUE and compute.oob = TRUE.

forest_full <- DirichletRF(X, Y, num.trees = 100,
                            distributional = TRUE,
                            replace = TRUE, compute.oob = TRUE)

W_oob <- forest_full$oob$weights   # n x n, generally asymmetric
W_sym <- (W_oob + t(W_oob)) / 2   # symmetrise if needed

⚙️ Main Parameters

Parameter Default Description
num.trees 100 Number of trees
max.depth 10 Maximum tree depth
min.node.size 5 Minimum leaf size
mtry -1 (= √p) Candidate features per split
est.method "mom" "mom" or "mle"
num.cores -1 (all−1) OpenMP threads
replace FALSE Bootstrap or subsampling
sample.fraction 1.0 Fraction of data per tree
compute.oob FALSE Compute OOB predictions
distributional FALSE Store leaf indices for weight-based predictions

Note: X must be numeric. Use one-hot encoding for categorical covariates. Rows of Y must sum to 1.


📚 Reference

Masoumifard, K., van der Westhuizen, S., & Gardner-Lubbe, S. (2026). Dirichlet random forest for predicting compositional data. In A. Bekker et al. (Eds.), Environmental Modelling with Contemporary Statistics. Chapman & Hall/CRC. ISBN: 9781032903910.


📄 License

GPL-3

About

This repository contains an experimental parallel implementation of Dirichlet Random Forests for compositional data.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors