Skip to content

C++ implementation of Dirichlet Process Bayesian Clustering

Notifications You must be signed in to change notification settings

ucyixl/DiPBaCpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

########## DiPBaC++: C++ and R code version 1.3.2 ##################

## Licence ###
(C) Copyright David Hastie and Silvia Liverani, 2012.

DiPBaC++ is free software; you can redistribute it and/or modify it under the 
terms of the GNU Lesser General Public License as published by the Free Software
Foundation; either version 3 of the License, or (at your option) any later 
version.

DiPBaC++ is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A 
PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License
along with DiPBaC++ in the documentation directory. If not, see 
<http://www.gnu.org/licenses/>.

The external linear algebra library Eigen, parts of which are included  in the 
lib directory is released under the LGPL3+ licence. See comments in file headers
for details.

The Boost C++ header library, parts of which are included in the  lib directory 
is released under the Boost Software Licence, Version 1.0, a copy  of which is 
included in the documentation directory.


### Description ###
Program to implement Dirichlet Process Bayesian Clustering as described in
Hastie et al. 2011. Previously this project was called profile regression.

   - Implements an infinite Dirichlet process model
   - Can do dependent or independent slice sampling (Kalli et al., 2011) 
     or truncated Dirichlet process model (Ishwaran and James, 2001)
   - Handles categorical or Normal covariates
   - Handles Bernoulli, Binomial, Categorical, Poisson or Normal responses
   - Handles inclusion of fixed effects in the response model
   - Handles Extra Variation in the response (for Bernoulli, Binomial and Poisson response only)
   - Handles variable selection (tested in Discrete covariate case only)
   - Includes label switching moves for better mixing
   - Allows user to exclude the response from the model
   - Allows user to compute the entropy of the allocation
   - Allows user to run with a fixed alpha or update alpha (default)
   - Allows users to run predictive scenarios (at C++ run time)
   - Basic or Rao-Blackwellised predictions can be produced
   - Handling of missing data
   - C++ for model fitting
   - Uses Eigen Linear Algebra Library and Boost C++
   - Completely self contained (all library code in included in distribution)
   - Adaptive MCMC where appropriate
   - R package for generating simulation data and post processing
   - R plotting functions allow user choice of what to order clusters by


### ChangeLog: ###
Changes from 1.3.1 to 1.3.2
   - Fixed a bug in the R code causing risks not to plot in order 
   - Removed small printing statements left in from debugging
Changes from 1.3.0 to 1.3.1
   - Fixed a numerical bug causing an infinite loop and also changed 
     predictive subjects so they could only be allocated to cluster containing
     at least one fitting subject

Changes from 1.2.0 to 1.3.0
   - Added in different sampler types. Single piece of software can be run as
     dependent or independent slice samplers (Kalli et al., 2011) or truncated
     Dirichlet process (Ishwaran and James, 2001).
   - Added in possibility to return the entropy of allocation.      

Changes from 1.1.0 to 1.2.0
	- Added categorical response (extra response variation not permitted in this 
	  case). Categorical response must be coded as integers from 0.
   - Changed the Linear Algebra library from Armadillo to Eigen

Changes from 1.0.1 to 1.1.0
   - Added normal response (extra response variation not permitted in this case)
   - Removed separate handling of ordinal models. NB: IMPORTANT- This means that
    input files generated for Discrete covariate models for previous versions
    are no longer compatible with this version, as the line indicating whether
    each covariate or ordinal or not must no longer be included. This has no
    impact for normal covariates
   - Restored random (but parsimonious) initial number of clusters
   - Changed prior of phi to be Dirichlet(0.5,...,0.5) by default instead of
     uniform.
   - Changed prior for alpha to be Gamma(2,1)
   - Fixed bug with computation of full log posterior introduced in version
     1.0.1. This did not affect running of sampler as this is only used for
     reporting purposes.
   - Renamed confounders to fixed effects throughout code
   - Removed the function that plotted riskProfileRank because scaling was
     misleading.
   - Removed some of the cluster comparison functions from the end of the post
     processing R files as these were little used and not really maintained.
   - Tidied R files for generating simulation data
   - Added in missing example files referred to in this README and added others

Changes from 1.0.0 to 1.0.1
   - Added calculation of log odds ratio into R computation of predictions
   - Hot fixed a bug with the R package where the C++ being called for
     computing the dissimilarity matrix was calling the wrong package.
      
1.0.0 was the first major version of this software, but the software
was previously available from the author in the profileRegression archive
Changes since profileRegression version 2.0.1
   - Changes to this README, in particular noting that it was --fixedAlloc not
     --fixedInit run time option that was removed in version 2.0.0, and
     correction of arguments to the discussion of R calcPredictions function.
   - Changes to R package to correct warnings from calcOptimalClustering not
     closing files, if maxNClusters argument was not null, and to remove unused
     argument to R calcPredictions function.
   - Fixed bugs in the C++ for implementation of variable selection for Normal
     covariates
   
### Known bugs and issues: ###
   - Variable selection for Normal covariates is only partially tested. C++ seems
     to behave well, but have not looked at post processing.


### Dependencies: ###
The following are packages and libraries are required
   - cmake

### Installation: ###
To install the program you need the free "cmake" build utility.

In the main directory type
>cmake .

Then
>make

This will create the executable in the bin subdirectory.


### Input Data: ###
Before the program can be run the data must be formatted in the expected way.
The input file is made up of the
   -No of subjects
   -No of covariates
   -Covariate names (1 per line)
   -No of fixed effects
   -Fixed effect names (1 per line, only if the no. of fixed effects > 0)
   -Number of categories per covariate (only if the covariates are discrete,
    i.e. categorical)
   -Data

Each line of the data corresponds to a separate subject and for each subject
should be in the order
y x1 x2 ... xJ w1 w2 ... wM T
where y is the outcome variable
      x are the covariates
      w are the fixed effects
      T (Poisson and Binomial models only).
       In the Poisson model T is the offset (must be 1 if no offset)
y~Poisson(mu),
log(mu) = theta + beta%*%W + log(T)
In the Binomial model, T is the total number of trials.
T should not be present for other response models.

Missing covariates should be denoted with the value -999. There is currently no
handling of missing fixed effects or missing offset. Please note, that even if
--excludeY is passed, the program expects a response in each row (and also the
offset or total number of trials if the yModel is Poisson or Binomial). In other
words, please prepare the input file as if you were going to be including Y in
the fit. It is fine (and proper if the yModel is Poisson or Binomial) to pass
both --yModel=... and --excludeY arguments as the --yModel helps determine how
the data is read, but will be ignored in terms of fitting if the --excludeY
argument is present.

An example input file for categorical outcome and categorical data is given in
data/input/example_Categorical_Discrete_input.txt
An example input file for binary outcome and categorical data is given in
data/input/example_Bernoulli_Discrete_input.txt
An example input file for Poisson outcome and categorical data is given in
data/input/example_Poisson_Discrete_input.txt
An example input file for Poisson outcome and Normal data is given in
data/input/example_Poisson_Normal_input.txt
An example input file for Binomial outcome and Normal data is given in
data/input/example_Binomial_Normal_input.txt
An example input file for Normal outcome and Normal data is given in
data/input/example_Normal_Normal_input.txt
An example input file for variable selection with Bernoulli outcome and
categorical data is given in
data/input/example_Var_Select_Bernoulli_Discrete_input.txt

These example datafiles are generated with the datasets in generateData.R in the
associated R package.

We recommend you store real data outside the data directory in this package to
avoid it being overwritten when the package is updated.

### Hyperparameters: ###

Hyperparameters for the priors can be specified in an input file which is passed
at command line using the --hyper option. Each row in this file should
correspond to a different hyper parameter and should be in the format

parameter1=value1
parameter2=value2

Where the parameter is a vector or matrix, the elements should be all on the
same line separated by spaces. The user can specify some or all hyperparameters.
Those hyperparameters not specified will take their default values. Where the
file is not provided, all hyperparameters will take their default values.

An example parameter file is provided in
data/input/example_hyperparameter_file.txt

The possible hyperparameters are (with definition):
shapeAlpha
- The shape parameter for Gamma prior on alpha (default=1.0)
rateAlpha
- The inverse-scale (rate) parameter for the Gamma prior on alpha (default=0.5)
useReciprocalNCatsPhi 
- Boolean denoting whether the vector phi_j (for covariate j) have all elements 
  equal (only used in the discrete covariate case, default=true)
aPhi 
- The vector of parameters for the Dirichlet prior on phi_j. Element j 
  corresponds to covariate j which then has a prior 
  Dirichlet(aPhi[j],aPhi[j],....,aPhi[j]). (Only used in discrete case if 
  useReciprocalNCatsPhi is false, default=(1 1 1 ... 1))
mu0
- The mean vector for mu_c in the Normal covariate case (only used in Normal 
covariate case, default=empirical covariate means)
Tau0
- The precision matrix for mu_c in the Normal covariate case (only used in 
  Normal covariate case, default=inverse of diagonal matrix with elements equal 
  to squareof empirical range for each covariate)
R0
- The matrix parameter for the Wishart distribution for Tau_c (only used in 
  Normal covariate case, default=1/nCovariates * inverse of empirical covariance 
  matrix)
kapp0 
- The degrees of freedom parameter for the Wishart distribution for Tau_c (only 
  used in Normal covariate case, default=nCovariates).
muTheta 
- The location parameter for the t-Distribution for theta_c (only used if 
  response included in model, default=0)
sigmaTheta
- The scale parameter for the t-Distribution for theta_c (only used if response 
  included in model, default=2.5)
dofTheta 
- The degrees of freedom parameter for the t-Distribution for theta_c (only used 
  if response included in model, default=7) 
muBeta 
- The location parameter for the t-Distribution for beta (only used when fixed 
  effects present, default=0)
sigmaBeta 
- The scale parameter for the t-Distribution for beta (only used when fixed 
effects present, default=2.5)
dofBeta 
- The dof parameter for the t-Distribution for beta (only used when fixed 
  effects present, default=7)
shapeTauEpsilon 
- Shape parameter for gamma distribution for prior for precision tau of extra 
  variation errors epsilon (only used if extra variation is used i.e. 
  --extraYVar flag is included, default=5.0)
rateTauEpsilon 
- Inverse-scale (rate) parameter for gamma distribution for prior for precision 
  tau of extra variation errors epsilon (only used if extra variation is used 
  i.e. --extraYVar flag is used, default=0.5)
aRho 
- Parameter for beta distribution for prior on rho in variable selection 
  (default=0.5)
bRho 
- Parameter for beta distribution for prior on rho in variable selection 
  (default=0.5)
shapeSigmaSqY 
- Shape parameter of inverse-gamma prior for sigma_Y^2 (only used in the Normal 
  response model, default =2.5)
scaleSigmaSqY 
- Scale parameter of inverse-gamma prior for sigma_Y^2 (only used in the Normal 
  response model, default =2.5)
rSlice 
- Slice parameter for independent slice sampler such that 
  xi_c = (1-rSlice)*rSlice^c for c=0,1,2,... (only used for slice independent
  sampler i.e. --sampler=SliceIndependent, default 0.75). 
truncationEps
- Parameter for determining the truncation level of the finite Dirichlet process
  (only used for truncated sampler i.e. --sampler=Truncated
 

### Predictions (at C++ run time) ###

The algorithm can now take an input file of predictive scenarios. The first line
in this file must contain the number of predictive subjects, and then subsequent
lines must have the covariate values for each of these subjects. An example file
is in example_Poisson_Normal_predictX.txt in the data/input folder.

At each iteration the predictive subjects are assigned to one of the current
clusters according to their covariate profiles (but ignoring missing values), or
their Rao Blackwellised estimate of theta is recorded (a weighted average of all
theta, weighted by the probability of allocation into each cluster.

The predictive subjects have no impact on the likelihood and so do not determine
the clustering or parameters at each iteration. The predictive allocations are
then recorded as extra entries in each row of the output_z.txt file. This can
then be processed in the R post processing to create a dissimilarity matrix with
the fitting subjects. The R post procesing function calcPredictions will create
predicted response values for these subjects.

### Running the program: ###

Runnning
>bin/DiPBaCpp --help
will give a summary of all the possible user run time options

An example call of the C++ program (from the main folder)
>bin/DiPBaCpp --input=data/input/example_Poisson_Normal_input.txt
 --output=data/output/output_example --nSweeps=10000 --nBurn=1000
 --yModel=Poisson --xModel=Normal --extraYVar
 --hyper=data/input/example_hyperparameter_file.txt
 --predict=data/input/example_Poisson_Normal_predictX.txt


### Output and post processing: ###

Once the C++ has completed the output from fitting the regression is stored in a
number of text files in the directory specified (in the above call the directory
is data/output and the file stem is output_example). Files are produced
containing the MCMC traces for all of the values of interest, along with a log
file and files for monitoring the acceptance rates of the adaptive Metropolis
Hastings moves.

To produce a graphical summary of the output there is the associated
R package in the rfiles subdirectory.

The package depends on the Rcpp, clue, cluster and ggplot2 packages (available
from CRAN)

To install the package from the rfiles subdirectory use
>R CMD INSTALL DiPBaC_1.3.1.tar.gz

Then in R the commands are:
>require(DiPBaC)
>runInfoObj<-readRunInfo('../data/output','output_example')
>dissimObj<-calcDissimilarityMatrix(runInfoObj)
>clusObj<-calcOptimalClustering(dissimObj)
>riskProfileObj<-calcAvgRiskAndProfile(clusObj)
>clusterOrderObj<-plotRiskProfile(riskProfileObj,'../data/output/summary.png')

The last two R functions take a little time.

The summaries produced are different from those reported in the previous papers.
In particular, all clusters are visually displayed together (this has worked
well for my problems with the number of clusters being returned, but may need
tweaking to split over a number of pages depending on the problem).

For discrete covariates, instead of plotting the probability that a phi is above
or below the mean value, we plot the actual phi values (and plot the mean value
across clusters as a horizontal line).

For normal covariates, for each covariate the upper plot is the posterior
distribution for the mean mu, and the lower plot is the posterior distribution
of sqrt(Sigma[j,j]) (i.e. the standard deviation for that covariate).

Additionally there is a function for visually plotting the cluster
>plotClustering(clusObj,'../data/output/cluster.png')
This function produces a visual representation of how the subjects cluster when
plotted against the two principal components. This function also requires a raw 
X data file, which contains (if available) the uncategorised covariate data, in 
the format 
  -No of subjects
  -No of covariates
  -Covariate names (1 per line)
  -Data

Each line of the data corresponds to a separate subject and for each subject
should be in the order
x1 x2 ... xJ

There is also now an additional function calcPredictions for computing predicted
responses, for various prediction scenarios. It is assumed that the predictive
allocations and Rao-Blackwell predictions have already been done in C++ (using
the --predict=<filename> run time option). The user can provide the function
with a file through the predictResponseFileName argument. This file has the
number of subjects, followed by a row for each subject, where each row contains
values for the response, fixed effects and offset / number of trials (depending
on the response model) where available. Missing values in this file are denoted
(-999). If the file is not provided then the response, fixed effect and offset
data is treated as missing for all subjects. If a subject is missing fixed
effect values, then the mean value or 0 category fixed effect is used in the
predictions (i.e. no fixed effect contribution to predicted response). If the
offset / number of trials is missing this value is taken to be 1 when making
predictions. If the response is provided for all subjects, the predicted
responses are compared with the observed responses and the bias and rmse are
computed.

The function can produce predicted values based on simple allocations
(the default), or a Rao-Blackwellised estimate of predictions, where
the probabilities of allocations are used instead of actually performing a
random allocation.

An example file where the fixed effects can be provided for prediction but the
observed response is missing is data/input/example_Poisson_Normal_predictW.txt
(there are 2 fixed effects in this example). An example of using the function,
which would do the Rao Blackwellised predictions, is given by

>calcPredictions(riskProfileObj,
  predictResponseFileName='../data/input/example_Poisson_Normal_predictW.txt',
  doRaoBlackwell=T)

An example file where both the observed response and fixed effects are present
is in data/input/example_Poisson_Normal_predictYW.txt (there are no fixed
effects in this example, but these would just be added as columns between the
first and last columns). An example of using the function, which would do the
simple predictions using the allocations produced by the C++, is given by

>calcPredictions(riskProfileObj,
  predictResponseFileName='../data/input/example_Poisson_Normal_predictYW.txt',
  doRaoBlackwell=F)

In order to support variable selection runs, the plotRiskProfile and
plotRiskProfileRank functions have an argument useProfileStar (by default false)
which, if true, will use the composite phi (the mixture between phi and the null
phi) or composite mu (the mixture between mu and null mu) in the plots. It is
also possible to use the whichCovariates argument to restrict which covariates
are plotted. There is also the function summariseVarSelectRho for summarising
the continuous switches in variable selection runs.

There is additionally some extra functions for computing the divergence between
two separate profile regressions runs. This work is joint work with Georgios
Papageorgiou and is work in progress. We are currently performing simulations to
understand how this measure works and will provide further notes in later
releases. 

About

C++ implementation of Dirichlet Process Bayesian Clustering

Resources

Stars

Watchers

Forks

Packages

No packages published