Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Scaling kills DPGMM [was: mixture.DPGMM not fitting to data] #2454

caofan opened this Issue Sep 18, 2013 · 6 comments


None yet
5 participants

caofan commented Sep 18, 2013

I am trying out the Gaussian mixture models in the package. I tried to model a mixture with two Components, G(1000,500^2), and G(2000,600^2). The following is the code:

data = np.random.normal(1000,500,1000)
data2 = np.random.normal(2000,600,1000)
data = list(data) + list(data2)
model = mixture.DPGMM(n_components=10,alpha=10,n_iter=10000)
print model.means_

And I got the following means of the components.
[[ 0.13436485]
[ 0.13199086]
[ 0.11750537]
[ 0.10560644]
[ 0.12162311]
[ 0.00204134]
[ 0.12058521]
[ 0.11997703]
[ 0.11944384]
[ 0.11890694]]

It seems the model does not fit properly to the data. Is it a bug or I have got something wrong in the application of the model?


@arjoly arjoly added the Bug label May 11, 2014

@amueller amueller added this to the 0.15.1 milestone Jul 18, 2014


amueller commented Jan 28, 2015

This looks pretty bad :-/


amueller commented Jan 28, 2015

My explanation for this is: the model assumes a N(0, 1) prior on the means [and also a fixed prior on the covariance], which is not reasonable for your data. To make this work, the data should be scaled to have zero mean and unit variance. Then the result would be much more sensible.

I have to little experience in these kind of models to say what a good solution would be.
Possible candidates:

  • prescale the data (and adjust precision and mean that are estimated accordingly)
  • raise a warning?
  • use a hierarchical Baysian approach?
  • make the priors parameters of the estimator
  • estimate the priors from the data (which is probably the same as just rescaling the data)
  • use a much wider (or non-informative) prior on the means

Ps: any Baysian should feel free to hit me and implement the hierarchical approach.

@amueller amueller changed the title from mixture.DPGMM not fitting to data to Scaling kills DPGMM [was: mixture.DPGMM not fitting to data] Jan 28, 2015


amueller commented Jan 28, 2015

Thinking about it, I'm not sure if 1000 samples shouldn't be enough to overcome the prior... hum...


amueller commented Jan 29, 2015

The derivation of the mean http://scikit-learn.org/dev/modules/dp-derivation.html#the-updates is quite different from the one listed in Bishop's or Murphy's book. In particular, in the books the variational mean parameters don't depend on the variational precision parameters, which they do in the derivation in the docs (which is odd).
I'm a bit tempted to replace the implementation by a close correspondence to Bishop and see how that goes.


GaelVaroquaux commented Jan 29, 2015

I'm a bit tempted to replace the implementation by a close correspondence to
Bishop and see how that goes.

I am not very attached to our implementation. It has given us a lot of
problems in the past.

@amueller amueller modified the milestone: 0.16, 0.17 Sep 11, 2015

@amueller amueller modified the milestone: 0.18, 0.17 Sep 20, 2015


ogrisel commented Sep 10, 2016

Closing: the new Dirichlet process GMM re-write has been merged in master. It is not affected by this bug.

@ogrisel ogrisel closed this Sep 10, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment