Bug: GMM ``score()`` returns an array, not a value. #2473

Closed
jakevdp opened this Issue Sep 22, 2013 · 19 comments

Projects

None yet

7 participants

@jakevdp
Member
jakevdp commented Sep 22, 2013

The GMM.score() function returns an array, rather than a single value. This is inconsistent with the rest of scikit-learn: for example both sklearn.base.ClassifierMixin and sklearn.base.RegressorMixin implement a score() function which returns a single number, as do KMeans, KernelDensity, PCA, GaussianHMM, and others.

Currently, GMM.score() returns an array of the individual scores for each sample: this should probably be called GMM.score_samples(), and GMM.score() should return sum(GMM.score_samples()).

Note that in the last release, we renamed GMM.eval() to GMM.score_samples(). I believe this was a mistake: the score_samples label has a very general meaning (e.g. it is used within KernelDensity), while the results of GMM.eval() return a tuple containing the per-cluster likelihoods, which makes sense only with GMM.

If this change were made so that GMM.score() returned a single number, then the following recipe would work to optimize a GMM model (as it does for, e.g. KDE). As it is, this recipe fails for GMM:

import numpy as np
from sklearn.mixture import GMM
from sklearn.datasets import make_blobs
from sklearn.grid_search import GridSearchCV

X, y = make_blobs(100, 2, centers=3)

# use grid search cross-validation to optimize the gmm model
params = {'n_components': range(1, 5)}
grid = GridSearchCV(GMM(), params)
grid.fit(X)

print grid.best_estimator_.n_components

The result:

ValueError: scoring must return a number, got <type 'numpy.ndarray'> instead.
@jakevdp jakevdp referenced this issue Sep 22, 2013
Open

[MRG] Generative Classification #2468

4 of 5 tasks complete
@jakevdp
Member
jakevdp commented Sep 24, 2013

Because of the API confusion this change would generate, one potential option is to rename GMM to GaussianMixture, and implement the conformant API within the new class. This has the added bonus that GaussianMixture is more consistent with the naming conventions throughout the rest of the package. GMM could be deprecated, but kept around for a few releases so as to not break people's code. As long as the deprecation message notes the new API, it should be a manageable transition. Thoughts?

@kaushik94
Contributor

This should solve the issue #2911

@larsmans
Member
larsmans commented Mar 8, 2014

@jakevdp +1 for the renaming, but then we'd want to change score_samples at the same time, right?

@larsmans
Member
larsmans commented Mar 9, 2014

Maybe we should also fix the inconsistency reported in #2847 at the same time.

@jakevdp
Member
jakevdp commented Mar 9, 2014

+1 for the renaming, but then we'd want to change score_samples at the same time, right?

Yes. What I think should really happen is for us to decide on and document a density estimator API, which would be applicable to KernelDensity, GaussianMixture, and perhaps other methods in the future. It might even involve a DensityMixin class similar to the ClassifierMixin and RegressorMixin. I think the KDE interface is pretty clean, except for the fact that evaluating the density requires calling score_samples, which is very non-intuitive if you're a user trying to figure out how to compute a density estimation (a more reasonably named alias to that method would help immensely, IMO).

My thought is that that the GMM interface should conform to that, which mostly amounts to changing the behavior of score and score_samples.

@amueller amueller added this to the 0.15.1 milestone Jul 18, 2014
@amueller
Member
amueller commented Jan 9, 2015

And what should we do with VBGMM and DPGMM?
What is the difference between the score_samples and predict_proba? predict_proba returns per component likelihoods, right?

You are saying that score_samples should provide one score per sample, right? I agree. It should be the likelyhood of each point under the model.
And we need a method that gives cluster responsibilities. I thought that was predict_proba. Is it not?

Are there any other changes to the API that we need?

@agramfort
Member

sounds good to me

@jakevdp
Member
jakevdp commented Jan 10, 2015

You are saying that score_samples should provide one score per sample, right? I agree. It should be the likelihood of each point under the model.

Yes.

Are there any other changes to the API that we need?

The score method should probably return a single number, as it does across the rest of the package. This change would allow GMM to be used automatically with cross-validation. Currently, score returns a score per sample, which is inconsistent with the rest of the package.

For example, this works with KernelDensity:

import numpy as np
X = np.random.randn(100, 2)

from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KernelDensity
grid = GridSearchCV(KernelDensity(), {'bandwidth': [0.05, 0.1, 0.2]})
grid.fit(X)

But the equivalent does not currently work with GMM: it leads to an error because gmm.score returns an array:

from sklearn.mixture import GMM
grid = GridSearchCV(GMM(), {'n_components': [1, 2, 3]})
grid.fit(X) # <-- fails

And we need a method that gives cluster responsibilities.

True. I think you're right that predict_proba returns this already, and this is consistent with the way supervised classifiers return this information.

@larsmans
Member

Perhaps we should introduce new classes for this purpose, for minimal disturbance, then deprecate the old ones?

@jakevdp
Member
jakevdp commented Jan 12, 2015

Perhaps we should introduce new classes for this purpose, for minimal disturbance, then deprecate the old ones?

+1. We've long had a policy to avoid acronyms like GMM, but GMM was one of the earlier methods added before there was a lot of thought about that. If the method were being added now, it would probably be called GaussianMixture, so we could move to that when making these changes.

@jakevdp
Member
jakevdp commented Jan 12, 2015

I'd also say that to make sure the density estimator interface is well-defined and followed, we could add a DensityEstimatorMixin which does things like define the density and score methods based on score_samples, and perhaps others.

@amueller
Member

so score_samples would be a log-space version of density?

@amueller
Member

Also, again, if we rename GMMs we should probably also rename DPGMM and VBGMM

@jakevdp
Member
jakevdp commented Jan 13, 2015

so score_samples would be a log-space version of density?

That's what I'm thinking. In most cases I believe the sample score will essentially be the log-likelihood under the model. Do you know of any examples where this assumption might not hold?

@amueller
Member

I can't think of any.
I'm a bit hesitant to add public API that is just a np.log call away, though I do agree that often the non-log representation is more natural.

@jakevdp
Member
jakevdp commented Jan 13, 2015

My desire to do add that API is mainly to make using KernelDensity, GaussianMixture, and other density estimators more intuitive. See related #4062.

@xuewei4d xuewei4d added a commit to xuewei4d/scikit-learn that referenced this issue Jun 2, 2015
@xuewei4d xuewei4d Fix #2473. Add ```DensityMixin```. Change API of GMM, ```score_sample…
…s```, ```score```
419e692
@xuewei4d xuewei4d added a commit to xuewei4d/scikit-learn that referenced this issue Jun 2, 2015
@xuewei4d xuewei4d Fix #2473. Add ```DensityMixin```. Change API of GMM, ```score_sample…
…s```, ```score```
f17807c
@xuewei4d xuewei4d added a commit to xuewei4d/scikit-learn that referenced this issue Jun 2, 2015
@xuewei4d xuewei4d Fix #2473. Add ```DensityMixin```. Change API of GMM, ```score_sample…
…s```, ```score```
2c24872
@xuewei4d xuewei4d added a commit to xuewei4d/scikit-learn that referenced this issue Jun 2, 2015
@xuewei4d xuewei4d Fix #2473. Add ```DensityMixin```. Change API of GMM, ```score_sample…
…s```, ```score```
97480eb
@xuewei4d xuewei4d referenced this issue Jun 2, 2015
Closed

[RFC] GSoC2015 Improve GMM API #4802

14 of 16 tasks complete
@ogrisel
Member
ogrisel commented Jun 17, 2015

Re-reading this discussion I would like to make sure I understand this comment:

I'm a bit hesitant to add public API that is just a np.log call away, though I do agree that often the non-log representation is more natural.

I think the default score(X_test) should return the sum of the log likelihood of each sample in X_test which are negative. If the number of samples in X_test is small the np.exp(gmm.score(X_test)) will be strictly positive but typically close very 0.0. As soon as X_test has many samples, the likelihood will collapse to zero because of the lack of precision of floating point numbers.

So score should always be in log-likelihood space because this is the (log-)likelihood of several samples (assumed i.i.d) is very often too close to zero to be representable as a floating point.

We could add a density method that returns the per-sample likelihoods that should be defined individually directly as likelihood (without the log).

@amueller amueller modified the milestone: 0.16, 0.17 Sep 11, 2015
@amueller amueller modified the milestone: 0.18, 0.17 Sep 20, 2015
@tguillemot
Contributor
tguillemot commented Sep 2, 2016 edited

Solved by #6666

@ogrisel
Member
ogrisel commented Sep 10, 2016

Closing: the new Dirichlet process GMM re-write has been merged in master. Its score method is consistent with the rest of the library.

@ogrisel ogrisel closed this Sep 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment