Skip to content

Improving the performance of multivariate normal models in dirichletprocess

airde edited this page Apr 18, 2022 · 7 revisions

Background

dirichletprocess is a package for fitting nonparametric Bayesian models. It's all written in R and is designed to be easily adapted and modified for easy nonparametric model building. You can drop in dirichletprocess model objects as part of your model without having to worry about the underlying inference routines or algorithms.

Current multivariate normal sampling in the dirichletprocess package is slow and doesn’t scale with large dimensions or lots of data. This limits how the package can be used and restricts its overall adoption. By improving the performance for multivariate normal type models the number of applications and scale of the task sit can accomplish grows significantly. This will help people use Dirichlet process type models without having to implement their own sampling schemes and instead focus on the modelling work they are most interested in.

Dean Markwick wrote this package as part of his PhD and it has also received multiple contributions from different authors over the years. It has received some notable citations from R Koenker and Chris Holmes.

Related work

Typical Dirichlet process packages require you to use their specified models and lack the flexibility in constructing custom models. They also use C++ and can be tricky to build upon, requiring more background knowledge.

The mclust package implements EM algo with diagonal covariance matrices with several kinds of constraints (but only classical model selection, no Dirichlet process prior). M step update rules for constrained models are given in https://hal.inria.fr/inria-00074643

Details of your coding project

  1. Try changing the imported package from mvtnorm to mc2d (function rmultinormal provides random normal generation vectorized on mean/covariance parameters, so it may be quicker). This requires making sure no functionality is broken after making the change and also benchmarking the difference to ensure there is an increase in performance (decrease in computation time and/or memory usage). Detailed benchmarking could also highlight other areas of potential improvement.

The end result will provide: a. Robust benchmarking scripts. b. New methods that replace the mvtnorm with the mc2d package. c. Sufficient testing and checking that nothing has broken in the change over.

  1. Using a constrained covariance matrix. This introduces a new class of models for the multivariate normal distribution to allow for a constrained covariance matrix. This will reduce the number of free parameters and speed up

The end result will provide: a. New class of mixture models. b. Tests to ensure the functionality is correct. c. Documentation and extension to the vignette detailing the new models.

Expected impact

Faster model fitting will help improve the reach of the package and make it a viable option for larger scale problems. Better performance also helps the environment, reducing the amount of CPU cycles needed.

Mentors

Tests

Contributors, please do one or more of the following tests before contacting the mentors above.

Easy Download the package, fit the normal mixture model to the faithful data set. Fit the multivariate normal model on the iris or palmerspenguin dataset. Plot the resulting distribution for both models.

Medium Generate some random data from a lognormal distribution mixture model. Fit a Dirichlet process type model to this simulated data. Sample new data from the posterior of the final model and summarise the 5%/95% quantiles of the simulated data. Explore how the prior distribution on the alpha parameter effects the number of clusters. Plot the alpha parameter chains after using different prior distributions to assess how long the model takes to converge.

Hard Write the MixingDistribution objects and methods to implement your own custom mixture model. Fit this new mixture model to some new data (real or simulated) and plot the resulting posterior distribution. Write out the necessary equations underlying the mixture distribution to be included in the vignette.

Solutions of tests

Contributors, please post a link to your test results here.