Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding zeros in NMF β loss calculation causes inconsistency #25438

Closed
yotamcons opened this issue Jan 19, 2023 · 9 comments
Closed

Avoiding zeros in NMF β loss calculation causes inconsistency #25438

yotamcons opened this issue Jan 19, 2023 · 9 comments

Comments

@yotamcons
Copy link
Contributor

yotamcons commented Jan 19, 2023

WH_data[WH_data == 0] = EPSILON

The line WH_data[WH_data == 0] = ΕPSILON as part of the β loss calculation creates a situation where some entries in the data that were larger than other (say 1E-09 was larger than zero) are now smaller than the other.
I think the code should be corrected to
WH_data[WH_data < EPSILON] = ΕPSILON
to avoid such inconsistencies.
Another issue is preventing overflows in the devision that happens a few lines later - if the data has the smallest positive number (around 10^-38), dividing a number larger than 1 by it causes the overflow.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jan 19, 2023
@yotamcons
Copy link
Contributor Author

yotamcons commented Jan 19, 2023

Searching the file for additional occurrences, a similar situation appears in 9 others situations throughout this file, mostly in the context of β loss calculation, but also generally to avoid devision by zero.
I don't know what is sklearn's general policy regarding adding something to avoid devision by zero, but i suggest maybe working with finfo.smallest_normal or something between that and the defined ε (say the geometric average of the two, or the square root of smallest_normal). see here for details

@TomDLT
Copy link
Member

TomDLT commented Jan 24, 2023

Thanks for the report. Changing to WH_data[WH_data < EPSILON] = ΕPSILON seems reasonable, most likely more robust than WH_data[WH_data == 0] = ΕPSILON. Do you want to submit a pull-request?

Here, EPSILON = np.finfo(np.float32).eps, which is quite standard in scikit-learn. What is the reason you suggest working with finfo.smallest_normal instead?

@yotamcons
Copy link
Contributor Author

I can submit a pull request, any specifics i need for the process?

Regarding the ε size, specifically in NMF the WH matrix is the multiplication of two matrices W & H, and it might be the values in the ε range of either one is important, and when they are multiplied the result is going to be in the ε^2 range. I don't think that its critical though.
A more important note is that the behavior in other decomposition modules is slightly different:
_lda.py uses EPS = np.info(float).eps - that is system dependent,
_pca.py uses eps = 1e-15 - that is not far from np.finfo(np.float64).eps
test_nmf.py uses np.finfo(np.float64).eps
and _nmf.py uses np.finfo(np.float32).eps
Ideally the choice should be dependent on the datatype of X and WH itself, but as EPSILON is a constant of the module I would suggest going with the decision of the LDA implementation, using np.info(float).eps

@glevv
Copy link
Contributor

glevv commented Jan 28, 2023

Shouldn't WH_data += ΕPSILON be more mathematically correct?

@yotamcons
Copy link
Contributor Author

Great question @glevv - I'm not qualified to address it. I can say that purely speaking, if the support of one distribution doesn't match that of the other (i.e. one has zero where the other is non-zero), strictly speaking their KL divergence should be infinity.
This brings to mind another consideration - if the X matrix size is in the order of magnitude that of 1/epsilon (which would be the case now if X has 10k samples and 1k features), adding ε to all of the cells might make the calculation for a distribution matrix (Sum of all entries == 1 ) way off. I think this is another justification for lowering the value of ε to the 10^-15 domain.

@glevv
Copy link
Contributor

glevv commented Feb 2, 2023

It's just the matter of reweighing the probabilities

from scipy.special import kl_div
probs = np.asarray([0., 0., 0.9, 0.1] * 1000) / 1000
probs_y = np.asarray([0.1, 0.1, 0.7, 0.1] * 1000) / 1000
EPSILON = 1e-7
print(kl_div(probs, probs_y).sum()) # straightforward
print(kl_div(probs * (1-EPSILON) + EPSILON / probs.shape[0], probs_y).sum()) # reweighed
print(kl_div(np.where(probs < EPSILON, EPSILON, probs), probs_y).sum()) # proposed

0.22618298545281545
0.22618215902713518
0.224601434397019

Higher epsilon (like 1e-3) will give very high discrepancy between proposed and straightforward method, while reweighed still will be close enough (3rd digit after floating point). With lower epsilon (like 1e-15) all the methods will "converge" to the same value.

If we reverse values (kl_div(probs_y, probs) ) we will get

inf
2.864440919575757
1.2056309559997935

And second method will give higher number (but not infinity) than 3rd with any epsilon, which is what is needed.

@yotamcons
Copy link
Contributor Author

Allright, so how do you suggest I change the pull request (one or a combination of the following)?
option 1: just adjusting EPSILON = 1E-15
option 2: change WH_data[WH_data < EPSILON] = ΕPSILON to WH_data += ΕPSILON
option 3: readjust inside each location eps = EPSILON / WH_data.shape[0]

@glevv
Copy link
Contributor

glevv commented Feb 2, 2023

I think it's up to core team to decide how to move forward with this issue

@glevv
Copy link
Contributor

glevv commented Sep 18, 2023

Should this one be closed?

@TomDLT TomDLT closed this as completed Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants