Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinearDiscriminantAnalysis predict probability bug #6848

Closed
agamemnonc opened this Issue Jun 1, 2016 · 10 comments

Comments

Projects
7 participants
@agamemnonc
Copy link
Contributor

agamemnonc commented Jun 1, 2016

I am pretty confident there is a bug introduced in commit
7c1101d

Concretely, line 518 of the current version
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py
should be removed as it yields wrong results.

There is no reason why constant 1 should be added to the computed probability after exponentiation and before inversion.

To verify this, I have run a one-to-one comparison between the outcome of the method and MATLAB's builtin LDA classifier on the Iris dataset. Only after removal of line 518, results match (up to a tolerance).

If everyone agrees on that, I am happy to submit a PR.

@amueller amueller added the Bug label Jul 15, 2016

@amueller amueller added this to the 0.18 milestone Jul 15, 2016

@ghost

This comment has been minimized.

Copy link

ghost commented Jul 17, 2016

Hi Agamemnon,

I am currently looking at this issue. I agree with you that there is a bug but I do not think it is related to adding the constant 1. I think that the code for LDA works fine for dataset with two classes but it uses an incorrect formula for three or more classes. Do you mind sharing your MATLAB's results for the Iris dataset? A Gist would be great.

Thanks.

@agamemnonc

This comment has been minimized.

Copy link
Contributor Author

agamemnonc commented Jul 18, 2016

MATLAB snippet (version R2016a)
https://gist.github.com/agamemnonc/4c9ddb472cd7c3ac84718cee1d2f6ce6

Output:
0.0000 0.9993 0.0007

Python/sklearn snippet (sklearn version 0.17.1)
https://gist.github.com/agamemnonc/0d668ae2552188b73958be7961a7e9e3

Output (currently)
[ 0.33333333 0.33333333 0.33333333]

Output (when line in question removed)
[ 4.69392059e-20 9.99390132e-01 6.09867606e-04]

@ghost

This comment has been minimized.

Copy link

ghost commented Jul 18, 2016

Great. I appreciate your info. This is very helpful. Can you send me your output for a couple of more values? In particular, 23 and 101? Thanks a lot.

@agamemnonc

This comment has been minimized.

Copy link
Contributor Author

agamemnonc commented Jul 19, 2016

Sure, no problem.

Data point 24 (please note MATLAB uses 1-based indexing, so this is 23 in Python)
[1.0000 0.0000 0.0000]

Data point 102 (101 in Python)
[0.0000 0.0011 0.9989]

From a theoretical point of view, I still believe that the line prob += 1 is the one causing the trouble. decision_function(x) returns the log-probability up to an additive constant. Then, we only need to exponentiate to get the unnormalized probability distribution (i.e probability distribution up to a multiplicative constant due to the exponentiation).

prob *= -1 followed by np.reciprocal(prob, prob) is used to avoid overflow issues I believe.

In mathematical terms.

log (p(y=k|x)) = δ_k(x)
-log (p(y=k|x)) = - δ_k(x)
e ^ (-log (p(y=k|x)) ) = e ^ (- δ_k(x))
e ^ log(1/(p(y=k|x))) = e ^ (- δ_k(x))
1/(p(y=k|x)) = e ^ (- δ_k(x))
p(y=k|x) = 1 / ( e ^ (- δ_k(x))) (unnormalized)

As you can hopefully see in the above formulae adding a constant to the posterior probability is not required and indeed it affects the results.

@ghost

This comment has been minimized.

Copy link

ghost commented Jul 21, 2016

Hi, thanks again.

In your notation, what do you mean by π and δ? Isn't π the ratio of the two posterior probabilities? If so, where do you assume its dependency with respect to the classes?
Have you looked at the documentation?

I do have a solution which does not involve removing +1 but modifying the formula per the case of more than three classes. I am busy right now but I will work on it as soon as I have time. Thanks.

@agamemnonc

This comment has been minimized.

Copy link
Contributor Author

agamemnonc commented Jul 25, 2016

Sorry, I should have made it more explicit. I follow the notation of the textbook "The elements of statistical learning" by Hastie, Tibshirani and Friedman which is freely available for download (pdf) here
https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

δ (it should be δ_k) is the discriminant function for class k, i.e. the element of the vector returned by the decision_function() method which corresponds to class k. Please refer to 4.10 (p 109) of the textbook.

By π_k I meant the posterior probability for class k (it is not a fraction). However in the textbook π_k is used to denote the prior probability so apologies about the confusion. From now on, I denote the prior with π_k and the posterior p(y=k|x). (I' ve also edited my comment above to reflect this).

In the multiclass case the posterior probabilities are given by

p(y=k|x) = e ^ {δ_k(x)} / sum(e ^ {δ_k(x)}).

In the binary classification case, you can replace p(y=1|x) with 1 - p(y=0|x) and then substitute in the fraction (which only makes sense in the binary case) (eq 4.9 of textbook) and I suspect that this is where the +1 comes from. But in the multiclass case we can't do that and I think since it generalises the binary case it is more elegant.

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Sep 6, 2016

The += 1 (alone) is not the problem. If you remove that line the tests fail with negative probability values...

The problem likely stems from the implementation of the decision_function from LinearClassifierMixin (which depends on the standard coef_ and intercept_ attributes) which is probably not directly equivalent to the traditional LDA parameterization of the log likelihood ratio of the positive class w.r.t. another fixed class. More work is needed to workout the correct normalization while keeping the LinearClassifierMixin parameterization which is useful to get LDA models to behave consistently like other sklearn linear models.

@amueller amueller modified the milestones: 0.18, 0.19 Sep 22, 2016

@amueller amueller modified the milestone: 0.19 Sep 29, 2016

@jnothman jnothman modified the milestones: 0.20, 0.19 Jun 13, 2017

@jnothman jnothman changed the title LDA predict probability bug LinearDiscriminantAnalysis predict probability bug Jun 14, 2017

@glemaitre glemaitre modified the milestones: 0.20, 0.21 Jun 13, 2018

@xavierbourretsicotte

This comment has been minimized.

Copy link

xavierbourretsicotte commented Jun 18, 2018

Hi guys, is this bug still open? I confirm that I get bizarre results using LDA with 3 classes on the IRIS dataset (X1 (horizontal axis) = sepal_length, X2 (vertical axis) = sepal_width)

In particular, the covariance matrix seems to be off: for example the covariance matrix given by lda.covariance_ = array([[0.259708 , 0.09086667], [0.09086667, 0.11308 ]])

and the covariance given by covariance.empirical_covariance(X) =
array([[ 0.68112222, -0.04215111], [-0.04215111, 0.18871289]])

Which is what I get manually and with pandas.cov.

Since the covariance is off, the gaussian estimate is also off and hence the predictions ? All the best

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jun 18, 2018

@agamemnonc

This comment has been minimized.

Copy link
Contributor Author

agamemnonc commented Aug 11, 2018

I have now submitted a PR which I think solves the issue: #11796

Do we perhaps want to include additional tests checking the output of predict_proba for LDA and QDA both for the binary and multi-class cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.