Local minima is a known problem for latent variable models. Common solutions include random starts and good initialization (e.g., K-means). In 2D, this problem can be by-passed by having a mixture of many Gaussians, as I have done here. This might be a good motivation for putting a sparsity prior over the mixture weights, as done in Bayesian mixture of Gaussians, though such a model is too complicated. For higher dimensions, having many Gaussians might not be a good strategy due to the curse of dimensionality (i.e., how many is enough?), though my code supports arbitrary dimensionality and would run fine.
Legend:
- 1st image: true (empirical) density (from data)
- 2nd image: learned (empirical) density
- 3rd image: learning curve
Legend:
- 1st image: true (unnormalized) density
- 2nd image: learned (empirical) density with arrows showing the initial and final positions of Gaussian means
- 3rd image: learned mixture weights
- 4th image: learning curve