Avoiding zeros in NMF β loss calculation causes inconsistency #25438

yotamcons · 2023-01-19T20:38:47Z

scikit-learn/sklearn/decomposition/_nmf.py

Line 143 in 3f82f84

WH_data[WH_data == 0] = EPSILON

The line WH_data[WH_data == 0] = ΕPSILON as part of the β loss calculation creates a situation where some entries in the data that were larger than other (say 1E-09 was larger than zero) are now smaller than the other.
I think the code should be corrected to
WH_data[WH_data < EPSILON] = ΕPSILON
to avoid such inconsistencies.
Another issue is preventing overflows in the devision that happens a few lines later - if the data has the smallest positive number (around 10^-38), dividing a number larger than 1 by it causes the overflow.

The text was updated successfully, but these errors were encountered:

yotamcons · 2023-01-19T20:59:29Z

Searching the file for additional occurrences, a similar situation appears in 9 others situations throughout this file, mostly in the context of β loss calculation, but also generally to avoid devision by zero.
I don't know what is sklearn's general policy regarding adding something to avoid devision by zero, but i suggest maybe working with finfo.smallest_normal or something between that and the defined ε (say the geometric average of the two, or the square root of smallest_normal). see here for details

TomDLT · 2023-01-24T23:50:07Z

Thanks for the report. Changing to WH_data[WH_data < EPSILON] = ΕPSILON seems reasonable, most likely more robust than WH_data[WH_data == 0] = ΕPSILON. Do you want to submit a pull-request?

Here, EPSILON = np.finfo(np.float32).eps, which is quite standard in scikit-learn. What is the reason you suggest working with finfo.smallest_normal instead?

yotamcons · 2023-01-26T11:40:46Z

I can submit a pull request, any specifics i need for the process?

Regarding the ε size, specifically in NMF the WH matrix is the multiplication of two matrices W & H, and it might be the values in the ε range of either one is important, and when they are multiplied the result is going to be in the ε^2 range. I don't think that its critical though.
A more important note is that the behavior in other decomposition modules is slightly different:
_lda.py uses EPS = np.info(float).eps - that is system dependent,
_pca.py uses eps = 1e-15 - that is not far from np.finfo(np.float64).eps
test_nmf.py uses np.finfo(np.float64).eps
and _nmf.py uses np.finfo(np.float32).eps
Ideally the choice should be dependent on the datatype of X and WH itself, but as EPSILON is a constant of the module I would suggest going with the decision of the LDA implementation, using np.info(float).eps

glevv · 2023-01-28T07:45:20Z

Shouldn't WH_data += ΕPSILON be more mathematically correct?

yotamcons · 2023-01-29T20:21:04Z

Great question @glevv - I'm not qualified to address it. I can say that purely speaking, if the support of one distribution doesn't match that of the other (i.e. one has zero where the other is non-zero), strictly speaking their KL divergence should be infinity.
This brings to mind another consideration - if the X matrix size is in the order of magnitude that of 1/epsilon (which would be the case now if X has 10k samples and 1k features), adding ε to all of the cells might make the calculation for a distribution matrix (Sum of all entries == 1 ) way off. I think this is another justification for lowering the value of ε to the 10^-15 domain.

glevv · 2023-02-02T08:55:14Z

It's just the matter of reweighing the probabilities

from scipy.special import kl_div
probs = np.asarray([0., 0., 0.9, 0.1] * 1000) / 1000
probs_y = np.asarray([0.1, 0.1, 0.7, 0.1] * 1000) / 1000
EPSILON = 1e-7
print(kl_div(probs, probs_y).sum()) # straightforward
print(kl_div(probs * (1-EPSILON) + EPSILON / probs.shape[0], probs_y).sum()) # reweighed
print(kl_div(np.where(probs < EPSILON, EPSILON, probs), probs_y).sum()) # proposed

0.22618298545281545
0.22618215902713518
0.224601434397019

Higher epsilon (like 1e-3) will give very high discrepancy between proposed and straightforward method, while reweighed still will be close enough (3rd digit after floating point). With lower epsilon (like 1e-15) all the methods will "converge" to the same value.

If we reverse values (kl_div(probs_y, probs) ) we will get

inf
2.864440919575757
1.2056309559997935

And second method will give higher number (but not infinity) than 3rd with any epsilon, which is what is needed.

yotamcons · 2023-02-02T09:17:18Z

Allright, so how do you suggest I change the pull request (one or a combination of the following)?
option 1: just adjusting EPSILON = 1E-15
option 2: change WH_data[WH_data < EPSILON] = ΕPSILON to WH_data += ΕPSILON
option 3: readjust inside each location eps = EPSILON / WH_data.shape[0]

glevv · 2023-02-02T09:40:16Z

I think it's up to core team to decide how to move forward with this issue

glevv · 2023-09-18T20:33:23Z

Should this one be closed?

github-actions bot added the Needs Triage Issue requires triage label Jan 19, 2023

yotamcons mentioned this issue Jan 26, 2023

FIX Issue 25438 Avoiding zeros in NMF β loss calculation causes inconsistency #25491

Merged

thomasjpfan added module:decomposition and removed Needs Triage Issue requires triage labels Jan 26, 2023

TomDLT closed this as completed Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding zeros in NMF β loss calculation causes inconsistency #25438

Avoiding zeros in NMF β loss calculation causes inconsistency #25438

yotamcons commented Jan 19, 2023 •

edited

Loading

yotamcons commented Jan 19, 2023 •

edited

Loading

TomDLT commented Jan 24, 2023

yotamcons commented Jan 26, 2023

glevv commented Jan 28, 2023

yotamcons commented Jan 29, 2023

glevv commented Feb 2, 2023 •

edited

Loading

yotamcons commented Feb 2, 2023

glevv commented Feb 2, 2023

glevv commented Sep 18, 2023

Avoiding zeros in NMF β loss calculation causes inconsistency #25438

Avoiding zeros in NMF β loss calculation causes inconsistency #25438

Comments

yotamcons commented Jan 19, 2023 • edited Loading

yotamcons commented Jan 19, 2023 • edited Loading

TomDLT commented Jan 24, 2023

yotamcons commented Jan 26, 2023

glevv commented Jan 28, 2023

yotamcons commented Jan 29, 2023

glevv commented Feb 2, 2023 • edited Loading

yotamcons commented Feb 2, 2023

glevv commented Feb 2, 2023

glevv commented Sep 18, 2023

yotamcons commented Jan 19, 2023 •

edited

Loading

yotamcons commented Jan 19, 2023 •

edited

Loading

glevv commented Feb 2, 2023 •

edited

Loading