New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LinearDiscriminantAnalysis predict probability bug #6848
Comments
Hi Agamemnon, I am currently looking at this issue. I agree with you that there is a bug but I do not think it is related to adding the constant 1. I think that the code for LDA works fine for dataset with two classes but it uses an incorrect formula for three or more classes. Do you mind sharing your MATLAB's results for the Iris dataset? A Gist would be great. Thanks. |
MATLAB snippet (version R2016a) Output: Python/sklearn snippet (sklearn version 0.17.1) Output (currently) Output (when line in question removed) |
Great. I appreciate your info. This is very helpful. Can you send me your output for a couple of more values? In particular, 23 and 101? Thanks a lot. |
Sure, no problem. Data point 24 (please note MATLAB uses 1-based indexing, so this is 23 in Python) Data point 102 (101 in Python) From a theoretical point of view, I still believe that the line prob += 1 is the one causing the trouble. decision_function(x) returns the log-probability up to an additive constant. Then, we only need to exponentiate to get the unnormalized probability distribution (i.e probability distribution up to a multiplicative constant due to the exponentiation). prob *= -1 followed by np.reciprocal(prob, prob) is used to avoid overflow issues I believe. In mathematical terms. log (p(y=k|x)) = δ_k(x) As you can hopefully see in the above formulae adding a constant to the posterior probability is not required and indeed it affects the results. |
Hi, thanks again. In your notation, what do you mean by π and δ? Isn't π the ratio of the two posterior probabilities? If so, where do you assume its dependency with respect to the classes? I do have a solution which does not involve removing +1 but modifying the formula per the case of more than three classes. I am busy right now but I will work on it as soon as I have time. Thanks. |
Sorry, I should have made it more explicit. I follow the notation of the textbook "The elements of statistical learning" by Hastie, Tibshirani and Friedman which is freely available for download (pdf) here δ (it should be δ_k) is the discriminant function for class k, i.e. the element of the vector returned by the decision_function() method which corresponds to class k. Please refer to 4.10 (p 109) of the textbook. By π_k I meant the posterior probability for class k (it is not a fraction). However in the textbook π_k is used to denote the prior probability so apologies about the confusion. From now on, I denote the prior with π_k and the posterior p(y=k|x). (I' ve also edited my comment above to reflect this). In the multiclass case the posterior probabilities are given by p(y=k|x) = e ^ {δ_k(x)} / sum(e ^ {δ_k(x)}). In the binary classification case, you can replace p(y=1|x) with 1 - p(y=0|x) and then substitute in the fraction (which only makes sense in the binary case) (eq 4.9 of textbook) and I suspect that this is where the +1 comes from. But in the multiclass case we can't do that and I think since it generalises the binary case it is more elegant. |
The += 1 (alone) is not the problem. If you remove that line the tests fail with negative probability values... The problem likely stems from the implementation of the |
Hi guys, is this bug still open? I confirm that I get bizarre results using LDA with 3 classes on the IRIS dataset (X1 (horizontal axis) = sepal_length, X2 (vertical axis) = sepal_width) In particular, the covariance matrix seems to be off: for example the covariance matrix given by lda.covariance_ = array([[0.259708 , 0.09086667], [0.09086667, 0.11308 ]]) and the covariance given by covariance.empirical_covariance(X) = Which is what I get manually and with pandas.cov. Since the covariance is off, the gaussian estimate is also off and hence the predictions ? All the best |
the cov in LDA is computed after centering with the class mean. So
it's not the cov of the full X.
ok?
|
I have now submitted a PR which I think solves the issue: #11796 Do we perhaps want to include additional tests checking the output of |
I am pretty confident there is a bug introduced in commit
7c1101d
Concretely, line 518 of the current version
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py
should be removed as it yields wrong results.
There is no reason why constant 1 should be added to the computed probability after exponentiation and before inversion.
To verify this, I have run a one-to-one comparison between the outcome of the method and MATLAB's builtin LDA classifier on the Iris dataset. Only after removal of line 518, results match (up to a tolerance).
If everyone agrees on that, I am happy to submit a PR.
The text was updated successfully, but these errors were encountered: