diff --git a/doc/cookbook/source/examples/clustering/gmm.rst b/doc/cookbook/source/examples/clustering/gmm.rst new file mode 100644 index 00000000000..9d82ad963ae --- /dev/null +++ b/doc/cookbook/source/examples/clustering/gmm.rst @@ -0,0 +1,53 @@ +======================= +Gaussian Mixture Models +======================= + +A Gaussian mixture model is a probabilistic model that assumes that data are generated from a finite mixture of Gaussians with unknown parameters. The model likelihood can be written as: + +.. math:: + + p(x|\theta) = \sum_{i=1}^{K}{\pi_i \mathcal{N}(x|\mu_i, \Sigma_i)} + +where :math:`p(x|\theta)` is probability distribution given :math:`\theta:=\{\pi_i, \mu_i, \Sigma_i\}_{i=1}^K`, :math:`K` denotes number of mixture components, :math:`\pi_i` denotes weight for :math:`i`-th component, :math:`\mathcal{N}` denotes a multivariate normal distribution with mean vector :math:`\mu_i` and covariance matrix :math:`\Sigma_i`. + +The expectation maximization (EM) algorithm is used to learn parameters of the model, via finding a local maximum of a lower bound on the likelihood. + +See Chapter 20 in :cite:`barber2012bayesian` for a detailed introduction. + +------- +Example +------- + +We start by creating CDenseFeatures (here 64 bit floats aka RealFeatures) as + +.. sgexample:: gmm.sg:create_features + +We initialize :sgclass:`GMM`, passing the desired number of mixture components. + +.. sgexample:: gmm.sg:create_gmm_instance + +We provide training features to the :sgclass:`GMM` object, train it by using EM algorithm and sample data-points from the trained model. + +.. sgexample:: gmm.sg:train_sample + +We extract parameters like :math:`\pi`, :math:`\mu_i` and :math:`\Sigma_i` for any componenet from the trained model. + +.. sgexample:: gmm.sg:extract_params + +We obtain log likelihood of belonging to clusters and being generated by this model. + +.. sgexample:: gmm.sg:cluster_output + +We can also use Split-Merge Expectation-Maximization algorithm :cite:`ueda2000smem` for training. + +.. sgexample:: gmm.sg:training_smem + +---------- +References +---------- +:wiki:`Mixture_model` + +:wiki:`Expectation–maximization_algorithm` + +.. bibliography:: ../../references.bib + :filter: docname in docnames diff --git a/doc/cookbook/source/index.rst b/doc/cookbook/source/index.rst index af78e9663d3..23dd222b469 100644 --- a/doc/cookbook/source/index.rst +++ b/doc/cookbook/source/index.rst @@ -40,3 +40,12 @@ Gaussian Processes :glob: examples/gaussian_processes/** + +Clustering +---------- + +.. toctree:: + :maxdepth: 1 + :glob: + + examples/clustering/** diff --git a/doc/cookbook/source/references.bib b/doc/cookbook/source/references.bib index 7dc43325c6f..4316c5cf309 100644 --- a/doc/cookbook/source/references.bib +++ b/doc/cookbook/source/references.bib @@ -24,3 +24,13 @@ @book{Rasmussen2005GPM year = {2005}, publisher = {The MIT Press} } +@article{ueda2000smem, + title={SMEM Algorithm for Mixture Models}, + author={N. Ueda and R. Nakano and Z. Ghahramani and G.E. Hinton}, + journal={Neural Computation}, + volume={12}, + number={9}, + pages={2109--2128}, + year={2000}, + publisher={MIT Press} +} \ No newline at end of file diff --git a/examples/meta/src/clustering/gmm.sg b/examples/meta/src/clustering/gmm.sg new file mode 100644 index 00000000000..afb3d440ac4 --- /dev/null +++ b/examples/meta/src/clustering/gmm.sg @@ -0,0 +1,31 @@ +CSVFile f_feats_train("../../data/classifier_4class_2d_linear_features_train.dat") + +#![create_features] +RealFeatures features_train(f_feats_train) +#![create_features] + +#![create_gmm_instance] +int num_components = 3 +GMM gmm(num_components) +#![create_gmm_instance] + +#![train_sample] +gmm.set_features(features_train) +gmm.train_em() +RealVector output = gmm.sample() +#![train_sample] + +#![extract_params] +int component_num = 1 +RealVector nth_mean = gmm.get_nth_mean(component_num) +RealMatrix nth_cov = gmm.get_nth_cov(component_num) +RealVector coef = gmm.get_coef() +#![extract_params] + +#![cluster_output] +RealVector log_likelihoods = gmm.cluster(nth_mean) +#![cluster_output] + +#![training_smem] +gmm.train_smem() +#![training_smem] \ No newline at end of file